Closed jdhayhurst closed 3 years ago
The actual validation of the submission has been successfully tested as per https://github.com/EBISPOT/goci/issues/275. The handover to the sumstats service is now async - so that shouldn't be an issue. I'll start looking into the import part.
@ljwh2 @eks-ebi - quick question on the import into the Curation App (which might also be connected to https://github.com/EBISPOT/goci/issues/102): are we expecting curators/users to add associations for these large submissions? In particular ... are we expecting a large number of associations (> ~1,000) to be ingested at any point in time with any particular submission?
Reason for asking: importing associations relies on one of the Ensembl APIs. This leads to two challenges:
Long story short - if we're expecting a very large number of associations - we need to re-write from scratch the import process (practically to turn it inside-out). Alternatively, we can look into moving some of the steps done during import in the Deposition App - to be performed during validation.
Yes, we would expect *curators to add associations for large submissions, so I think this will need to be investigated. I think > ~1,000 associations in a submission is quite plausible.
*(only curators at this stage, not external submitters)
@sprintell - below is for your consideration a possible implementation plan to handle the import of large submission:
Curation App - Import process
Curation UI -- Publication page (listing all studies) needs to be paginated - this will, however, impact the bulk operations currently available on that page -- Bulk operations - including EFO assignment, curator assignment, publishing, etc need to be transformed to take a publication-centric approach (rather than using lists of studies) -- Individual (study-level) operations stay the same
Search UI - potential consequences of supporting submissions with large number of studies
*Note: some of the comments below assume that some kind of sorting of associations takes place before displaying them in the Associations table. I'm not actually sure whether or how this happens - requires investigation.
Publication page a. Associations table -- paginated, only 5 studies displayed by default -- however, potential problems if table attempts to evaluate all associations for 100s-1000s of studies in the publication before sorting by p-value and displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?)
Study page a. Associations table -- paginated, only 5 studies displayed by default -- however, even with only 1 study display, could load slowly if large number of associations are evaluated and sorted before displaying the top 5
Trait page -- the trait page already loads quite slowly, prob due to loading of complex EFO data (including child traits) - adding more data to the Catalog always has potential to make this problem worse a. Associations table -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given trait, the more data required to be displayed on the corresponding Trait page -- potential problems if all associations for a trait are evaluated and sorted before displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?) c. LocusZoom plot -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given trait, the more data required to be displayed in this plot
Variant page a. Associations table -- not directly affected by any specific submission being large, however the more associations included in the Catalog for a given variant, the more data required to be displayed on the corresponding Variant page -- potential problems if all associations for a variant are evaluated and sorted before displaying the top 5 b. Studies table -- probably OK, as only 5 studies displayed by default, no default sorting(?) c. Traits table -- probably OK, as only 5 traits displayed by default - however, they seem to be evaluated before loading and sorted by association count by default, which could cause problems with large numbers of traits per variant d. LD plot -- not directly affected by any specific submission being large, but could be affected by general increase in Catalog content, also relies on communication with Ensembl
Gene page -- already loads slowly, prob because it needs to load Gene mapping for multiple variants and communicate with Ensembl to load Gene info a. Associations / Studies / Traits tables -- as for Variant above
Region page (both Cytogenetic region e.g. "13q14.11" and custom bp range e.g. "6:16000000-25000000") -- already loads slowly, prob because it needs to construct the region on the fly and communicate with Ensembl a. Associations / Studies / Traits tables -- as for Variant above
"Download Catalog Data" button (on all of the above pages) -- creates custom download of all data for the current page -- Study and Publication downloads could be affected by very large individual submissions (resulting spreadsheet could be very large if there are many studies and/or many associations per study) -- Trait, Gene, Variant, Region downloads more generally affected by increasing amounts of data in Catalog
Custom .csv downloads for each data table (button in top right of Associations, Studies etc. tables) -- Study and Publication downloads could be affected by very large individual submissions (resulting spreadsheet could be very large if there are many studies and/or many associations per study) -- Trait, Gene, Variant, Region downloads more generally affected by increasing amounts of data in Catalog
Diagram -- already very slow and cumbersome, so needs work anyway -- not directly affected by any specific submission being large, but could be affected by general increase in Catalog content
@sprintell Is disk space going to be an issue here?
@tudorgroza - Create a visual representation of the changes above. Include possibly next steps related to moving the validation into the deposition app.
Candidate for testing new import process: PMID 32128760
- Submission ID 5e9f08d1a5990c0001fea437
@tudorgroza do you need a curator to test this?
Eventually, yes :) I added that comment just for me not to forget about it :) Thank you.
Currently we can only support submissions that contain up to a few hundred studies.
Once identified, come up with some solutions/tickets to address these bottlenecks.