Transfer US Census data releases and publish in production per release schedule (ongoing effort) - Githubissues

IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34

5 stars 1 forks source link

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

Closed landreev closed 8 months ago

landreev commented 1 year ago

Will add details here. There's an ongoing discussion of what's involved in the dedicated slack channel and a google doc.

landreev commented 1 year ago

Transferred the first 20GB data sample as a test. Added to the draft dataset in the new Census collection in prod.: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6&version=DRAFT Some, more sensitive details can be found in the email thread.

landreev commented 1 year ago

Working now on the next stage of the project - attempting a bucket-to-bucket transfer test of a large-ish (~500GB) chunk of data, to estimate how much time it is going to take for the main, TB-sized transfer. To run a bucket-to-bucket copy via aws api (basically, aws s3 cp s3://sourcebucket/xxx s3://destinationbucket/yyy) you need an aws role that has access to both buckets, read to the source, and write to the destination. (i.e., there is no way to auth with 2 different roles for source and dest.; I only learned this last week). So we need an IAM role created for the aws account that owns the prod. bucket, so that Census could grant it read access to theirs. I cannot do this myself with aws cli, so we need the LTS to create it for us. So this is the current step, made a request via lts-prodops.

From census: For the bucket to bucket , the design is to have the recipient (yourself) create the IAM principal, and we would grant the account/principal read access to our bucket. For the larger dataset this is likely going to be critical. We should have taken those steps this time, sorry for the extra work.

landreev commented 1 year ago

Had the IAM role created, passed it to the Census contact for granting read access to the data source. Can't wait to try to actually move a few 100GBs between buckets! ("between buckets" means we will never need to transfer any data to our own aws nodes, the prod. servers. There the transfer is limited to something like 150MB/s. It is roughly the same rate we are getting when reading from our own prod. bucket, and the one US Census one. I'm really curious to see how fast it is going to be when it's contained within S3)

landreev commented 1 year ago

Unfortunately, we haven't been able to run a successful direct transfer today (I haven't been able to gain read access with the local AIM user that they tried to grant such on their end... something on their end, it appears). I'm very much interested in testing this rather sooner than later; and just genuinely curious about the performance. So I'll try to either make myself available to re-test next week, if/when they make any tweaks on their end; or I will pass the task onto somebody else.

landreev commented 1 year ago

I don't know if this is a metaphor of some sorts for government work in general, but it is increasingly looking like two weeks had been spent figuring out how to do this data transfer the "smart way", only to discover that the smart way was in fact slower than the "dumb", brute force way, that had been available from the get go.

Specifically, the direct "bucket-to-bucket" transfer that we finally got to work does appear to be taking longer than the "round trip" method of copying the data from their bucket to our server, then copying to our bucket. (My best guess is that any potential speed advantage of copying directly was entirely offset by the fact that the source and destination buckets live in 2 different data center regions, us-east-2 and us-east-1, respectively).

cmbz commented 1 year ago

Plan is to host latest batch of 26T of US Census data on NESE tape and make it available via a Globus link from a Harvard Dataverse dataset page.

Current status:

Currently working with FASRC to provision 40T of tapes in Northeast Storage Exchange (NESE) with Globus access to support 26T of US Census Data. Goal is to start the process on 2023/06/28 and begin data transfer as soon as possible.
@sbarbosadataverse has created Harvard Dataverse dataset page

cmbz commented 1 year ago

The US Census team did not want to purse the NESE tape option due to 100MB minimum file size requirement (would have required them to reorganize and repackage their dataset files)
We will continue to work with them to find alternative storage strategies

cmbz commented 1 year ago

[x] Complete US Census Data Project Infrastructure Options Comparison Chart as input to next data hosting proposal to US Census team (@cmbz) (Waiting on input from FASRC and other stakeholders)
Update
- Still awaiting input
- USCB has been informed that a new proposal is being prepared, pending input

cmbz commented 1 year ago

We met with Scott Yockel this morning to discuss NESE tape and NESE disk options for supporting the 26T USCB dataset.
@siacus is considering options for a new proposal to USCB for support.

cmbz commented 1 year ago

Meeting has been scheduled for 2023/09/20 with Harvard Library, Harvard Research Computing and Data, and other stakeholders to discuss proposed model for supporting USCB big data using NERC resources.

cmbz commented 10 months ago

2024/01/03 Update: Tim was informed of new NESE disk option for large data support on 2023/12/20. We are waiting to hear if they want to move forward with the option.