EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Investigate and fix import process #111

Closed tudorgroza closed 3 years ago

tudorgroza commented 3 years ago

@ljwh2 @eks-ebi - the first "major" change I would like us to introduce while re-implementing the import is to transform it from a 'real-time' process to a 'push-and-forget'.

Currently, when you press the 'Import' button, you wait for the process to finish and to see the associated results. This works, IF, they're not too many studies linked to the submission. Examples such as the Sidore submission, on the other hand, will fail. A second issue is the lack of a proper audit trail associated with the import. Currently, if the import fails, you get a message in the UI - which disappears pretty quickly, and there's no way to get it back. This leaves the dev team in darkness w.r.t. the cause of the failure.

Changing to a 'push-and-forget' will:

Feed-back would be greatly appreciated.

eks-ebi commented 3 years ago

I don't see any problem with this plan as far as I know. Are there any disadvantages to the 'push-and-forget' model?

ljwh2 commented 3 years ago

This sounds fine to me @tudorgroza, please go ahead

sprintell commented 3 years ago

I already had a chat about this with Tudor as well, its going to be an excellent solution, the push-and-forget approach will be more stable with consistent outcomes for the varying file sizes and all use cases

tudorgroza commented 3 years ago

List of failures to be tested in the new import process:

Please feel free to add others.

ljwh2 commented 3 years ago

Can we have a bit more detail here @tudorgroza ?

When you say "no rsID, no risk allele" - does this mean no text in the field? Sometimes curators will not have risk allele info so will need to enter "?". And not all variants have rsIDs as identifier.

Missing location/gene - this is not entered in the template, so nothing to import? This info comes from the mapping pipeline - are errors in the mapping pipeline causing the import to fail?

tudorgroza commented 3 years ago

Finding answers to your questions would be the purpose of creating these tests. So far, based on my current understanding of the code, most items above will not fail the import, but will be flagged. Others - coming from the mapping pipeline - (more concretely pretty much all exceptions raised by the mapping pipeline) - will cause the process to fail.

We just need tests that are written to intentionally fail - for us to make sure the process does indeed fail when it's expected to fail.

earlEBI commented 3 years ago

uploaded template Sorgini test template.xlsx to PMID 31225122. On clicking import, import failed - http://snoopy.ebi.ac.uk:9680/gwas/curation/submissions/5f466b5a62694a0001d4f3f0: === GENERAL OUTCOME: FAIL 1[Success] :: [Verifying studies | 5f466b5a62694a0001d4f3f0] 2[Fail] :: [Validating associations | 5f466b5a62694a0001d4f3f0] 3[Success] :: [Updating submission status: IMPORT_FAILED | 5f466b5a62694a0001d4f3f0] Errors: Validating associations | [GCST90000181 | testing1] Full validation error: OR and OR reciprocal [data]: OR value is less than 1 and no OR reciprocal value entered | false Validating associations | [GCST90000181 | testing1] Full validation error: Beta [data]: Value is not empty | false Validating associations | [GCST90000181 | testing1] Full validation error: Beta Unit [data]: Value is not empty | false Validating associations | [GCST90000181 | testing1] Full validation error: Beta Direction [data]: Value is not empty | false Validating associations | [GCST90000181 | testing1] Full validation error: OR reciprocal range [data]: Value is empty | true Validating associations | [GCST90000181 | testing1] Full validation error: P-value mantissa [data]: Value not valid i.e. greater than 9 | false Validating associations | [GCST90000181 | testing1] Full validation error: P-value mantissa [data]: Value is empty | false Validating associations | [GCST90000181 | testing1] Full validation error: P-value exponent [data]: Value is empty | false Validating associations | [GCST90000181 | testing1] Full validation error: OR reciprocal range [data]: Value is empty | true

tudorgroza commented 3 years ago

@earlEBI - thank you for this! I need to make the output prettier :) Is the outcome expected or wrong?

earlEBI commented 3 years ago

I can't see anything wrong with the template, myself! From what i understand of the errors: - "OR and OR reciprocal [data]: OR value is less than 1 and no OR reciprocal value entered" - the OR I entered is not less than 1... and there is no OR reciprocal column in the new template. - beta [data]: "value is not empty" - what does that mean? - beta unit [data]: "value is not empty" - why is that a problem? - beta direction [data]: "value is not empty" - there is no beta direction column in the template anymore. - "P-value mantissa [data]: Value not valid i.e. greater than 9" - I thought it might automatically round the mantissa and exponent down, but that's okay if that's not possible. Although ideally it would fail in the template. - "P-value exponent [data]: Value is empty" - I think this is because I entered a p-value of 1e-1000 to see if it could handle it. Doesn't seem it can. - "OR reciprocal range [data]: Value is empty" - again, not relevant

Hopefully all that's readable...

earlEBI commented 3 years ago

Ah, I see on this page http://snoopy.ebi.ac.uk:9680/gwas/curation/submissions/5f466b5a62694a0001d4f3f0 under the Associations header, for some reason the OR and CI lower for the first association have been filled in with 0, so that's what some of the errors refer to. But those fields were empty in the upload template.

Screenshot 2020-08-26 at 16 21 49
eks-ebi commented 3 years ago

My Test Submission 1 (Auckland): http://193.62.54.159/gwas/deposition/submission/5f55fd0f10d43700010d95d3 http://snoopy:9680/gwas/curation/submissions/5f55fd0f10d43700010d95d3

Features

  1. Incorrect rsID (starts with "rs..." but is not in Ensembl)
  2. Incorrect variant ID (does not start with "rs...")

Result:

The submission preview is visible, but the Import button is inactive, i.e. unable to import this submission.

Error messages:

=== GENERAL OUTCOME: FAIL 1[Success]

:: [Verifying studies | 5f55fd0f10d43700010d95d3] 2[Fail]

:: [Validating associations | 5f55fd0f10d43700010d95d3] 3[Success]

This seems to say that the association validation was successful - but that doesn't seem to make sense because there are errors with the associations lists below

:: [Updating submission status: IMPORT_FAILED | 5f55fd0f10d43700010d95d3]

** Errors: **

Validating associations | [GCST90000188 | study1] Full validation error: OR reciprocal range [data]: Value is empty | true

Validating associations | [GCST90000188 | study1] Full validation error: OR reciprocal range [data]: Value is empty | true

We see this error a lot when adding data directly in the curation interface. There, it is only a warning and does not prevent us from publishing the study. I don't know if this error cause the import to fail? It probably shouldn't. In fact, it is not a meaningful error and ideally would be removed, because: 1) There is no "OR reciprocal range" column in the deposition template, so it would be impossible to enter a value for this field using this method. 2) If you enter a number >1 in the "OR" column, then there is no need for an "OR reciprocal range", so it should not be entered.

Validating associations | [GCST90000188 | study1] Full validation error: OR and OR reciprocal [data]: OR value is less than 1 and no OR reciprocal value entered | false

Another hold-over from adding data directly in the curation interface. There, we always have to enter an OR >1. If the OR reported in the paper is <1, then we instead enter it in the "OR reciprocal" field and there is a button to convert that to the OR (by doing a 1/x operation). But in the template there is only an "OR" field, which accepts numbers > or <1. I think the plan for the future was to store OR <1 as is from now on, without doing any inversion, so this warning should not be necessary. @earlEBI @ljwh2 is that right?

Validating associations | [GCST90000188 | study1] Full validation error: OR reciprocal range [data]: Value is empty | true

Validating associations | [GCST90000189 | study2] :: rs000 not found for homo_sapiens

Validating associations | [GCST90000189 | study2] :: Attempt to map SNP: rs000 returned no location details

Validating associations | [GCST90000189 | study2] :: Attempt to map SNP: rs000 returned no mapped genes

Validating associations | [GCST90000189 | study2] :: variant123 not found for homo_sapiens

Validating associations | [GCST90000189 | study2] :: Attempt to map SNP: variant123 returned no location details

Validating associations | [GCST90000189 | study2] :: Attempt to map SNP: variant123 returned no mapped genes

The rs000 and variant123 errors are all expected, and I think they should prevent importing. But I don't understand why these only appear for [GCST90000189 | study2]: I entered the same false rsID for [GCST90000188 | study1], so I would expect to also see the same errors for that study.

eks-ebi commented 3 years ago

@tudorgroza I looked into testing the following errors:

But these are all picked up in the deposition app, so therefore don't make it to the import stage. I guess there's a chance they could slip through if there's a problem with the depo validation, but not sure how to test these in curation.

the relevant errors in depo:

Associations: Column 'Effect allele' in row(s) [7] lacks mandatory value
Associations: Column 'Variant ID' in row(s) [6] lacks mandatory value
Associations: Column 'Effect allele' in row(s) [8-9] contains incorrect value. Accepted values should follow the pattern: '^[ACTG?-]*$'
Associations: Column 'Other allele' in row(s) [7] contains incorrect value. Accepted values should follow the pattern: '^[ACTG?-]*$'
eks-ebi commented 3 years ago

My Test Submission 2 (Avet-Loiseau) http://193.62.54.159/gwas/deposition/submission/5f56110010d43700010fafaa http://snoopy:9680/gwas/curation/submissions/5f56110010d43700010fafaa?

Features:

Results: No errors on the import preview screen and the Import button is active. However, when I clicked Import, the submission did not get imported (only the previously existing study can be seen in Curation Home: http://snoopy:9680/gwas/curation/studies?page=1&author=Avet-Loiseau%20H), and the following errors appeared on the preview page:

=== GENERAL OUTCOME: FAIL 1[Success]

:: [Verifying studies | 5f56110010d43700010fafaa] 2[Fail]

:: [Validating associations | 5f56110010d43700010fafaa] 3[Success]

:: [Updating submission status: IMPORT_FAILED | 5f56110010d43700010fafaa]

** Errors: **

Validating associations | [GCST90000190 | avet_study] Full validation error: Risk element (allele, haplotype or SNPxSNP interaction) frequency in controls [data]: Value is invalid, value is not between 0 and 1 | true

This is an expected error. The study should not be imported if the risk element is invalid (i.e. not between 0 and 1).

Validating associations | [GCST90000190 | avet_study] Full validation error: P-value mantissa [data]: Value not valid i.e. greater than 9 | false

This is an expected error. The study should not be imported if the p-value rounds to 1E-5 or greater, based on the current validation requirements in the curation interface.

There was no errors about the Other allele being missing - I think that is correct, as curators will not always be able to extract an Other allele.

There was no error about the OR/Beta being absent - that is the correct behaviour.

ljwh2 commented 3 years ago

In response to above comments: I think the plan for the future was to store OR <1 as is from now on, without doing any inversion, so this warning should not be necessary. @earlEBI @ljwh2 is that right? yes, that's correct. The rs000 and variant123 errors are all expected, and I think they should prevent importing. I don't agree, we may need to enter rsIDs not in Ensembl (if that is what authors have reported). Also may need to enter non-standard variant IDs. Could be worth considering what the impact of the Ensembl pipeline failing would be (which it does from time to time). Would that prevent us importing anything?

tudorgroza commented 3 years ago

Thanks so much @eks-ebi ! I'll go through these and see what's going on.

tudorgroza commented 3 years ago

@ljwh2

Could be worth considering what the impact of the Ensembl pipeline failing would be (which it does from time to time). Would that prevent us importing anything?

Yes - that should stop the import process and report the error.

eks-ebi commented 3 years ago

My Test Submission 3 (Bae) http://193.62.54.159/gwas/deposition/submission/5f56305f10d437000112fc51 http://snoopy:9680/gwas/curation/submissions/5f56305f10d437000112fc51?

Features:

Result:

=== GENERAL OUTCOME: SUCCESS 1[Success] :: [Verifying studies | 5f56305f10d437000112fc51] 2[Success] :: [Validating associations | 5f56305f10d437000112fc51] 3[Success] :: [Deleting proxy studies | 5f56305f10d437000112fc51] 4[Success] :: [Creating study [GCST90000191] | 5f56305f10d437000112fc51] 5[Success] :: [Creating association | GCST90000191] 6[Success] :: [Creating sample | GCST90000191] 7[Success] :: [Creating study [GCST90000192] | 5f56305f10d437000112fc51] 8[Success] :: [Creating association | GCST90000192] 9[Success] :: [Creating sample | GCST90000192] 10[Success] :: [Creating study [GCST90000193] | 5f56305f10d437000112fc51] 11[Success] :: [Creating association | GCST90000193] 12[Success] :: [Creating sample | GCST90000193] 13[Success] :: [Creating study [GCST90000194] | 5f56305f10d437000112fc51] 14[Success] :: [Creating association | GCST90000194] 15[Success] :: [Creating sample | GCST90000194] 16[Success] :: [Creating study [GCST90000195] | 5f56305f10d437000112fc51] 17[Success] :: [Creating association | GCST90000195] 18[Success] :: [Creating sample | GCST90000195] 19[Success] :: [Creating study [GCST90000196] | 5f56305f10d437000112fc51] 20[Success] :: [Creating association | GCST90000196] 21[Success] :: [Creating sample | GCST90000196] 22[Success] :: [Creating study [GCST90000197] | 5f56305f10d437000112fc51] 23[Success] :: [Creating association | GCST90000197] 24[Success] :: [Creating sample | GCST90000197] 25[Success] :: [Creating study [GCST90000198] | 5f56305f10d437000112fc51] 26[Success] :: [Creating association | GCST90000198] 27[Success] :: [Creating sample | GCST90000198] 28[Success] :: [Deleting unpublished data | 5f56305f10d437000112fc51] 29[Success] :: [Updating submission status: CURATION_COMPLETE | 5f56305f10d437000112fc51]
tudorgroza commented 3 years ago

@eks-ebi @ljwh2 - We will discuss more during the call we will hopefully have in the next coming days, however, after going through all of Elliot's feedback it seems that the new import process works as expected - with the major difference being the way I've dealt with errors emerging from the existing association validation process (the current implementation basically just makes note of them, while I considered them proper errors).

Let's discuss the semantics of these errors so that I can adapt the process accordingly.

PS: @eks-ebi - w.r.t. Sporadic neuroblastoma - it is indeed a typo ... but in the DB, not in your template. The DB entry has an extra space at the end :)

eks-ebi commented 3 years ago

By the way, I just updated the Sporadic neuroblastoma entry in the curation interface, so it should no longer contain the extra space at the end.

eks-ebi commented 3 years ago

Yesterday morning, we discussed that Tudor has made changes to the display of warnings for imported submissions. Here, I am re-testing a previous submission that should pass import:

Same as the Bae submission above, this time submitted under a different publication (Bani-Fatemi 32445792): http://193.62.54.159/gwas/deposition/submission/5f60aae010d4370001b74e89 http://snoopy:9680/gwas/curation/submissions/5f60aae010d4370001b74e89

This submission is not expected to have any errors/warnings.

Result:

Import messages are nicely formatted:

=== GENERAL OUTCOME: SUCCESS
1[Success] :: [Verifying studies | 5f60aae010d4370001b74e89]
2[Success] :: [Validating associations | 5f60aae010d4370001b74e89]
3[Success] :: [Deleting proxy studies | 5f60aae010d4370001b74e89]
4[Success] :: [Creating study [GCST90000211] | 5f60aae010d4370001b74e89]
5[Success] :: [Creating association | GCST90000211]
6[Success] :: [Creating sample | GCST90000211]
7[Success] :: [Creating study [GCST90000212] | 5f60aae010d4370001b74e89]
8[Success] :: [Creating association | GCST90000212]
9[Success] :: [Creating sample | GCST90000212]
10[Success] :: [Creating study [GCST90000213] | 5f60aae010d4370001b74e89]
11[Success] :: [Creating association | GCST90000213]
12[Success] :: [Creating sample | GCST90000213]
13[Success] :: [Creating study [GCST90000214] | 5f60aae010d4370001b74e89]
14[Success] :: [Creating association | GCST90000214]
15[Success] :: [Creating sample | GCST90000214]
16[Success] :: [Creating study [GCST90000215] | 5f60aae010d4370001b74e89]
17[Success] :: [Creating association | GCST90000215]
18[Success] :: [Creating sample | GCST90000215]
19[Success] :: [Creating study [GCST90000216] | 5f60aae010d4370001b74e89]
20[Success] :: [Creating association | GCST90000216]
21[Success] :: [Creating sample | GCST90000216]
22[Success] :: [Creating study [GCST90000217] | 5f60aae010d4370001b74e89]
23[Success] :: [Creating association | GCST90000217]
24[Success] :: [Creating sample | GCST90000217]
25[Success] :: [Creating study [GCST90000218] | 5f60aae010d4370001b74e89]
26[Success] :: [Creating association | GCST90000218]
27[Success] :: [Creating sample | GCST90000218]
28[Success] :: [Deleting unpublished data | 5f60aae010d4370001b74e89]
29[Success] :: [Updating submission status: CURATION_COMPLETE | 5f60aae010d4370001b74e89]
eks-ebi commented 3 years ago

Another repeated submission, this time where some warnings are expected:

Template identical to the Avet-Loiseau submission above, this time submitted under a different publication (Bankvall 32558109): http://193.62.54.159/gwas/deposition/submission/5f60adb010d4370001b7d6dc http://snoopy:9680/gwas/curation/submissions/5f60adb010d4370001b7d6dc

Results:

Import messages:

=== GENERAL OUTCOME: FAIL
1[Success] :: [Verifying studies | 5f60adb010d4370001b7d6dc]
2[Success] :: [Validating associations | 5f60adb010d4370001b7d6dc]
3[Success] :: [Deleting proxy studies | 5f60adb010d4370001b7d6dc]
4[Success] :: [Creating study [GCST90000219] | 5f60adb010d4370001b7d6dc]
5[Success] :: [Creating association | GCST90000219]
6[Success] :: [Creating association | GCST90000219]
7[Fail] :: [Creating association | GCST90000219]
8[Fail] :: [Creating association | GCST90000219]
9[Success] :: [Creating sample | GCST90000219]
10[Success] :: [Deleting unpublished data | 5f60adb010d4370001b7d6dc]
11[Success] :: [Updating submission status: CURATION_COMPLETE | 5f60adb010d4370001b7d6dc]
** Errors: **
Creating association | Save error: true | Value is invalid, value is not between 0 and 1 | Risk element (allele, haplotype or SNPxSNP interaction) frequency in controls
Creating association | Save error: false | Value not valid i.e. greater than 9 | P-value mantissa
** Warnings: **
Validating associations | [GCST90000219 | avet_study] Full validation error: Risk element (allele, haplotype or SNPxSNP interaction) frequency in controls [data]: Value is invalid, value is not between 0 and 1 | true
Validating associations | [GCST90000219 | avet_study] Full validation error: P-value mantissa [data]: Value not valid i.e. greater than 9 | false
tudorgroza commented 3 years ago

@eks-ebi - the error / warning pairs is most likely a bug. I'll look into fixing it. Thank you.