Submissions with large numbers of studies need to be supported

jdhayhurst commented 4 years ago

Currently we can only support submissions that contain up to a few hundred studies.

identify where the bottlenecks are in the whole process
- in the submission process there are synchronous processes that will time out after a certain point
- Downstream processes to the summary stats file - validation, harmonisation, file hosting, data access
- importing to curation app
- presenting data on GWAS search UI

Once identified, come up with some solutions/tickets to address these bottlenecks.

tudorgroza commented 3 years ago

The actual validation of the submission has been successfully tested as per https://github.com/EBISPOT/goci/issues/275. The handover to the sumstats service is now async - so that shouldn't be an issue. I'll start looking into the import part.

tudorgroza commented 3 years ago

@ljwh2 @eks-ebi - quick question on the import into the Curation App (which might also be connected to https://github.com/EBISPOT/goci/issues/102): are we expecting curators/users to add associations for these large submissions? In particular ... are we expecting a large number of associations (> ~1,000) to be ingested at any point in time with any particular submission?

Reason for asking: importing associations relies on one of the Ensembl APIs. This leads to two challenges:

Because we continuously hit the API during import ... after a certain amount of requests (in a short time span) we get rejected. Standard defense mechanism for any properly implemented API. We can fix it by introducing delays.
On average, one association takes about 1s to process (validate, etc - including the call to Ensembl). This leads to about 15min for 1,000 associations - with no artificial delay introduced to avoid throttling. With delay ... it could double. Hence, for 10,000 associations we end up with roughly 300 min - i.e., 5h ... only for the associations part of one submission. If anything goes wrong in this time - we're left with a half-baked import.

Long story short - if we're expecting a very large number of associations - we need to re-write from scratch the import process (practically to turn it inside-out). Alternatively, we can look into moving some of the steps done during import in the Deposition App - to be performed during validation.

eks-ebi commented 3 years ago

Yes, we would expect *curators to add associations for large submissions, so I think this will need to be investigated. I think > ~1,000 associations in a submission is quite plausible.

*(only curators at this stage, not external submitters)

tudorgroza commented 3 years ago

@sprintell - below is for your consideration a possible implementation plan to handle the import of large submission:

Curation App - Import process

Retrieve studies paginated (instead of in bulk)
Change DTOs to encapsulate all information related to a study (associations & samples) - i.e., create study-centric DTOs in Depo App -- pre-assign associations to studies -- pre-assign samples to studies
On retrieve, dump data temporarily into a table instead of holding it in memory (to make it easier, we can just dump the individual study JSON DTOs into CLOB fields)
Transform import process to be study-centric, rather than submission-centric
Parallelise import process at study level, by introducing a processing queue with configurable capacity -- Process study import in the order specified by the queue -- During study import, process corresponding associations and samples -- For associations, decide whether an artificial delay is necessary to avoid throttling the Ensembl API -- Alternatively, move association validation / transformation into the Deposition App -- Keep track of failed study imports and move on. (to remove the all-or-nothing issue) -- In time, we can introduce a "smarter" scheduler to enable smaller submissions to be imported even if the queue is at capacity

Curation UI -- Publication page (listing all studies) needs to be paginated - this will, however, impact the bulk operations currently available on that page -- Bulk operations - including EFO assignment, curator assignment, publishing, etc need to be transformed to take a publication-centric approach (rather than using lists of studies) -- Individual (study-level) operations stay the same

eks-ebi commented 3 years ago

Search UI - potential consequences of supporting submissions with large number of studies

*Note: some of the comments below assume that some kind of sorting of associations takes place before displaying them in the Associations table. I'm not actually sure whether or how this happens - requires investigation.

Publication page a. Associations table -- paginated, only 5 studies displayed by default -- however, potential problems if table attempts to evaluate all associations for 100s-1000s of studies in the publication before sorting by p-value and displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?)
Study page a. Associations table -- paginated, only 5 studies displayed by default -- however, even with only 1 study display, could load slowly if large number of associations are evaluated and sorted before displaying the top 5
Trait page -- the trait page already loads quite slowly, prob due to loading of complex EFO data (including child traits) - adding more data to the Catalog always has potential to make this problem worse a. Associations table -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given trait, the more data required to be displayed on the corresponding Trait page -- potential problems if all associations for a trait are evaluated and sorted before displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?) c. LocusZoom plot -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given trait, the more data required to be displayed in this plot
Variant page a. Associations table -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given variant, the more data required to be displayed on the corresponding Variant page -- potential problems if all associations for a variant are evaluated and sorted before displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?) c. Traits table -- probably OK, as only 5 traits displayed by default - however, they seem to be evaluated before loading and sorted by association count by default, which could cause problems with large numbers of traits per variant d. LD plot -- not directly affected by any specific submission being large, but could be affected by general increase in Catalog content, also relies on communication with Ensembl
Gene page -- already loads slowly, prob because it needs to load Gene mapping for multiple variants and communicate with Ensembl to load Gene info a. Associations / Studies / Traits tables -- as for Variant above
Region page (both Cytogenetic region e.g. "13q14.11" and custom bp range e.g. "6:16000000-25000000") -- already loads slowly, prob because it needs to construct the region on the fly and communicate with Ensembl a. Associations / Studies / Traits tables -- as for Variant above
"Download Catalog Data" button (on all of the above pages) -- creates custom download of all data for the current page -- Study and Publication downloads could be affected by very large individual submissions (resulting spreadsheet could be very large if there are many studies and/or many associations per study) -- Trait, Gene, Variant, Region downloads more generally affected by increasing amounts of data in Catalog
Custom .csv downloads for each data table (button in top right of Associations, Studies etc. tables) -- Study and Publication downloads could be affected by very large individual submissions (resulting spreadsheet could be very large if there are many studies and/or many associations per study) -- Trait, Gene, Variant, Region downloads more generally affected by increasing amounts of data in Catalog
Diagram -- already very slow and cumbersome, so needs work anyway -- not directly affected by any specific submission being large, but could be affected by general increase in Catalog content

jdhayhurst commented 3 years ago

@sprintell Is disk space going to be an issue here?

tudorgroza commented 3 years ago

@tudorgroza - Create a visual representation of the changes above. Include possibly next steps related to moving the validation into the deposition app.

tudorgroza commented 3 years ago

Diagrams located here: https://docs.google.com/presentation/d/1fe7inD5w0Lrm1NxYtuw5ZY1hAVklyFz1OkTn6ADl1I4/edit?usp=sharing

tudorgroza commented 3 years ago

Candidate for testing new import process: PMID 32128760 - Submission ID 5e9f08d1a5990c0001fea437

ljwh2 commented 3 years ago

@tudorgroza do you need a curator to test this?

tudorgroza commented 3 years ago

Eventually, yes :) I added that comment just for me not to forget about it :) Thank you.

EBISPOT / goci

Submissions with large numbers of studies need to be supported #200