HumanCellAtlas / ingest-central

Ingest Central is the hub repository for the ingest service
Apache License 2.0
0 stars 1 forks source link

Submit Meyer dataset in Prod #595

Closed mshadbolt closed 4 years ago

mshadbolt commented 4 years ago

As a data wrangler I need to submit the Meyer dataset (https://github.com/HumanCellAtlas/hca-data-wrangling/issues/86) in prod, but given the issues that the spreadsheet caused, crashing both staging and integration I thought I would make a ticket to coordinate between ingest devs and myself as to what needs to be done in order to submit the dataset.

Background:

This dataset has some unique features that may be causing problems with the linking for ingest:

It is not yet known if ingest is struggling with the fact that there are sequencing files linked to different biomaterial types, or because the linking between donors and specimens is complex.

@rdgoite has been working with re-configuring the servers to ensure that the servers don't fall over when they encounter complicated linking but ingest are still to figure out the exact cause.

The project is now submitted in staging: https://staging.data.humancellatlas.org/explore/projects/bc2229e7-e330-435a-8c2a-4275741f2c2d It exported the correct number of bundles and linking appears to be correct It was not picked up by the staging tracker: https://tracker.staging.data.humancellatlas.org/

Now we need to figure out what needs to be done before I am able to submit to prod to ensure servers don't crash.

mshadbolt commented 4 years ago

Some detective work by Rolando and Rodrey figured out that the spreadsheet had over one million empty rows which was causing the server to crash. So uploading a spreadsheet without these rows should be okay.

@parthshahva got in contact about this dataset as @hannes-ucsc is concerned that the structure might cause issues for the browser so I will hold on trying to submit in prod for now. I am still hopeful I can submit early next week.

mshadbolt commented 4 years ago

@hannes-ucsc did you manage to see whether this dataset will cause issues for azul/data browser

hannes-ucsc commented 4 years ago

I looked at several bundles in that project and they all have cell suspensions linked to sequence file via sequencing process. This doesn't match the description above:

Sequence files from Bulk RNA and Whole Genome Sequencing are linked straight from specimen from organism, rather than from cell suspension as all experiments up to now have been

What am I missing?

hannes-ucsc commented 4 years ago

Can someone check the file counts on the DB for that project? What's the expected bundle count? Is there a mix of bundles, like some with the weird linking and some without? If so, what's a bunde FQID for such a bundle?

Screenshot from 2019-10-22 23-31-32

mshadbolt commented 4 years ago

The counts there seem correct, apart from the file count, but I guess that is some combination of the number of json files we create and duplicating files in bundles.

Here is the list of bundles that don't have cell suspensions: bundles_no_cell_suspension.txt

example of one: curl -X GET "https://dss.staging.data.humancellatlas.org/v1/bundles/041e4dc6-181b-48f4-9e13-0cf428ecce09?replica=aws&per_page=500" -H "accept: application/json"

hannes-ucsc commented 4 years ago

The counts there seem correct, apart from the file count, but I guess that is some combination of the number of json files we create and duplicating files in bundles.

DB displays the number of data files. It should represent reality, i.e. how many file were actually submitted. Is there a way to come up with a matching number on Ingest's side?

Here is the list of bundles that don't have cell suspensions: bundles_no_cell_suspension.txt

All of those bundles were indexed.

https://dss.staging.data.humancellatlas.org/v1/bundles/041e4dc6-181b-48f4-9e13-0cf428ecce09?replica=aws&per_page=500

The specimen in that bundle (biomaterial_id 367C72hOesophagusBulkRNA) shows up in the DB's samples tab. I think that's a strong indicator that all bundles of that shape were indexed correctly the same way.

mshadbolt commented 4 years ago

DB displays the number of data files. It should represent reality, i.e. how many file were actually submitted. Is there a way to come up with a matching number on Ingest's side?

I personally uploaded 370 files for the primary submission, which were fastqs and protocol documents so I guess the extras are the files that the analysis pipelines generate, including bams and the qc metrics etc. I'm not sure if there is a way of querying that from ingest.

Looking at the manifest all the analysed bundles get an extra 7 files generated but not sure which ones azul counts or what is considered a 'data file'.

hannes-ucsc commented 4 years ago

Using the file type facet in the data browser we can ascertain that there are 361 FASTQs, 7 PDFs and one DOC. The FASTQs are what you uploaded. Looking at the 7 PDFs and one DOC, they appear to be the protocol documentation you are referring to. There are 67 CSVs and looking at their file names all but one are submitted by analysis. The one named Tissue_Stability_CBTM_donor_information.csv is probably the one file missing (361+7+1+1 = 370). According to the tracker, Analysis didn't process all bundles so I think it would be premature to research whether we indexed the secondary bundles. But again, we have no failures for that project in staging, so I am going to call this one confirmed to be working.

[edit: I had omitted one DOC and the counts didn't add up, fixed now]

mshadbolt commented 4 years ago

@HumanCellAtlas/ingest I will be attempting to submit this in prod today.

I have started with trying to submit in staging.

The submission is here: https://ui.ingest.staging.data.humancellatlas.org/submissions/detail/5dc3d80ec338f50008eccc08/overview

First issue is that one of the library_preparation_protocols is failing to validate (https://api.ingest.staging.data.humancellatlas.org/protocols/5dc3d813c338f50008ecccf7)

Can someone take a look at why it is stuck in 'validating' status?

Thanks

mshadbolt commented 4 years ago

@aaclan-ebi as discussed earlier, I accidently uploaded one file that is not in the spreadsheet, are you able to please delete this file: https://api.ingest.staging.data.humancellatlas.org/files/5dc3d994c338f50008eccfd5

mshadbolt commented 4 years ago

Everything is now valid so will submit in staging

mshadbolt commented 4 years ago

The submission seemed to work successfully so I will now upload the spreadsheet and data file to prod.

aaclan-ebi commented 4 years ago

Just to note what happened. The extra file manually deleted https://api.ingest.staging.data.humancellatlas.org/files/5dc3d994c338f50008eccfd5 by setting the file to validating > valid (so state tracker would be notified) and deleting it.

Looks like there's an issue with ontology validation when the ontology value couldn't be found. We released ontology service version 1.0.11 in staging and prod and redeployed validator (to clear the cache) for ontology validation to work. After that we retriggered the ontology validation by setting the protocol metadata to Draft state. It's now valid. https://api.ingest.staging.data.humancellatlas.org/protocols/5dc3d813c338f50008ecccf7

mshadbolt commented 4 years ago

I have begun submitting to prod here: https://ui.ingest.data.humancellatlas.org/submissions/detail/5dc4081771fe4a0008e54859/overview

I created a tombstoning ticket for the old submission here: https://github.com/HumanCellAtlas/data-store/issues/2567

@MightyAx as I will be away for the next few days would it be possible for you to keep an eye on when the project is tombstoned and proceed with the uuid swap and submission?

The current uuid that we want to maintain is: c4077b3c-5c98-4d26-a614-246d12c2e5d7

MightyAx commented 4 years ago

Reingestion via New Project Document

mshadbolt commented 4 years ago

Confirming that all files are valid in prod so will be ready to submit once old project is tombstoned and uuid is swapped. Thanks very muchly in advance to @MightyAx and @aaclan-ebi for you help with getting this one through

jahilton commented 4 years ago

@mshadbolt what is the staging project uuid? I see a few different possibilities

MightyAx commented 4 years ago

@jahilton Marion is now Out Of Office. She submitted to staging today so this project must be the one: Project 259f9041-b72f-45ce-894d-b645add2e620 Submission 671a0817-d5d3-4ec0-a730-70b76c13581d

[edit: it helps to include the UUID in the link label, not just the link URL itself (@hannes-ucsc )]

ESapenaVentura commented 4 years ago

Given that both submissions are equal (Same amount of bundles generated) and that bundles were generated at around 12pm (e.g. https://dss.staging.data.humancellatlas.org/v1/bundles/d27901c6-f6ac-4b39-a1fd-a1fb49b507d1/?replica=aws&version2019-11-07T115502.445801Z) I would give all my pennies to the submission that @MightyAx is pointing to

jlzamanian commented 4 years ago

@mshadbolt @hannes-ucsc

In the staging submission, 1 of the bundles was not able to be indexed by Azul (see tracker). From the browser page, 359 rather than 361 fastq files are present, 367 total files rather than 370(?).

We need to make sure this won't happen for the prod submission.

MightyAx commented 4 years ago

pinging @ESapenaVentura, as Marion is now on her way to HCA Asia.

ESapenaVentura commented 4 years ago

I have no idea what is happening here. Do we know which bundle is missing? That might shed some light

jlzamanian commented 4 years ago

From my manual search, I think it is this bundle 4c8aab19-9a12-4a77-ab7b-8a93bb109b76 that was not indexed.

https://dss.staging.data.humancellatlas.org/v1/bundles/4c8aab19-9a12-4a77-ab7b-8a93bb109b76/?replica=aws&version2019-11-07T115502.447632Z

https://api.ingest.staging.data.humancellatlas.org/bundleManifests/5dc40662c338f50008ecd80d

hannes-ucsc commented 4 years ago

Azul never got a notification for bundle 4c8aab19-9a12-4a77-ab7b-8a93bb109b76. This could be related to a problem in DSS that I recently brought up with them. The signature traceback for this problem is

[ERROR] 2019-11-07T11:56:43.43Z b6de2a13-1216-5673-bcec-7c2d1ab45754    Error occurred while processing subscription 5152c2b5-c866-4cd3-aa0e-aec87cb88b4d for bundle 4c8aab19-9a12-4a77-ab7b-8a93bb109b76.2019-11-07T115502.447632Z.
Traceback (most recent call last):
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/var/lang/lib/python3.6/http/client.py", line 1346, in getresponse
    response.begin()
  File "/var/lang/lib/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/var/lang/lib/python3.6/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/var/lang/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/var/lang/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/var/lang/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/var/lang/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/python/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/opt/python/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/opt/python/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 423, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/opt/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: ChunkingHTTPSConnectionPool(host='search-dss-index-staging-bobpbiduntwlsh2yllwchsiypy.us-east-1.es.amazonaws.com', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/python/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 76, in perform_request
    response = self.session.send(prepared_request, **send_kwargs)
  File "/opt/python/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/opt/python/lib/python3.6/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: ChunkingHTTPSConnectionPool(host='search-dss-index-staging-bobpbiduntwlsh2yllwchsiypy.us-east-1.es.amazonaws.com', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/task/domovoilib/dss/index/es/backend.py", line 82, in _notify_subscribers
    subscription = self._get_subscription(bundle, subscription_id)
  File "/var/task/domovoilib/dss/index/es/backend.py", line 99, in _get_subscription
    body=subscription_query)
  File "/opt/python/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/opt/python/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 632, in search
    doc_type, '_search'), params=params, body=body)
  File "/opt/python/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/opt/python/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 84, in perform_request
    raise ConnectionTimeout('TIMEOUT', str(e), e)
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout(ChunkingHTTPSConnectionPool(host='search-dss-index-staging-bobpbiduntwlsh2yllwchsiypy.us-east-1.es.amazonaws.com', port=443): Read timed out. (read timeout=10))

BTW: Who is @jlzamanian ? There is no real name on the GH profile.

jlzamanian commented 4 years ago

BTW: Who is @jlzamanian ? There is no real name on the GH profile.

Jennifer Zamanian, Stanford Data Operations I've added my name to my GH profile.

hannes-ucsc commented 4 years ago

Duh! Sorry, Jennifer.

hannes-ucsc commented 4 years ago

Looping in @DailyDreaming from the DSS. Lon, could you link this to the ticket under which you are tracking the notification loss issue?

mshadbolt commented 4 years ago

@hannes-ucsc @DailyDreaming Is the issue being discussed something that is intermittent? Does it occur in prod as well as staging? Just trying to figure out if this is something that will definitely occur if I try submission in prod or not. Also can I submit in prod anyway and be fixed manually by azul/DSS afterwards?

mshadbolt commented 4 years ago

@MightyAx it looks like the project was tombstoned so it would be great if you could do the uuid swap and submit as soon as possible as the project has now disappeared from the browser and I would like to see it back up

MightyAx commented 4 years ago

Re-ingestion Notes

Target Project UUID: c4077b3c-5c98-4d26-a614-246d12c2e5d7 Target Binary: BinData(3,"Jk2YXDx7B8TX5cISbSQUpg==")

old submission: 02e89f20-84c8-4daa-aaeb-80f4a85733ff old project id 5cdc5ab7d96dad000859cec1 old project replacement uuid 2f406bf2-b2b5-4f2c-a009-feb4686fc4f0 Bundle manifests updated: 21

new submission fd52efcc-6924-4c8a-b68c-a299aea1d80f new project id 5dc4081c71fe4a0008e5485b

lauraclarke commented 4 years ago

@mshadbolt @MightyAx it would be good to understand why the tombstoning happened significantly before the UUID redirect so we can avoid this scenario from happening again

MightyAx commented 4 years ago

The new submission fd52efcc-6924-4c8a-b68c-a299aea1d80f has had it's project UUID replaced with the intended and has been submitted, currently processing.

mshadbolt commented 4 years ago

@lauraclarke the main reason for this was because dataops put a halt to tombstoning due to the azul indexing error above. By the time they gave the go ahead and the tombstoning was complete it was already the weekend in UK which meant it was unavailable all weekend.

I agree this isn't ideal and would advocate for at least some kind of placeholder page in between. Ideally that would look like the existing project page just without the data, but don't know how difficult that would be.

Given the tombstoning is done by data store in california and the uuid swap needs to be done by an ingest dev at ebi there is always likely to be some kind of gap but agree it would be better to have this better coordinated somehow.

lauraclarke commented 4 years ago

thanks for the summary @mshadbolt sounds pre-planning handovers and figuring out if there are sensible placeholders before we do this again would be good

jlzamanian commented 4 years ago

Sorry about this. I was trying to coordinate so that the tombstoning would happen early this week, but there was a miscommunication. Pre-planning would make things go more smoothly.