Implement import of trait names from ClinVar

tskir commented 4 years ago

Information about traits can be retrieved from a static, periodically updated endpoint https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz. There is an accompanying checksum file https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz.md5.

The main file is a gzipped TSV containing about two dozen columns (first line, starting with #, is a header). There are three columns of interest:

PhenotypeList column contains trait names. There can be one or multiple per record. If there are multiple, they are separated with semicolons, for example Inborn genetic diseases;Mitochondrial complex 1 deficiency, nuclear type 21;Mitochondrial complex I deficiency;not provided
RCVaccession contains IDs of ClinVar RCV records. An RCV record, in general, associates multiple genetic variants with multiple trait names. Similarly, there can be one or multiple, for example RCV000622708;RCV000735415;RCV000000017;RCV000196589.
AlleleID contains the identifier of a genetic variant. Example: 15046.

These columns follow this list of constraints:

The number of semicolon-separated values in PhenotypeList and RCVaccession is always the same for a given record and is always 1 or greater. The values with same same index form pairs.
AlleleID always contains an integer, however, it is not a key for the table and several rows can have the same AlleleID (see below).

Trait names imported into the database should represent a unique set of all trait names present in this file.

To calculate the number of records assigned to each trait, we need to consider all tuples of (AlleleID, RCVaccession, PhenotypeList). For example, exploding along RCVaccession and PhenotypeList, the sample record described above would generate four tuples:

(15046, RCV000622708, Inborn genetic diseases)
(15046, RCV000735415, Mitochondrial complex 1 deficiency, nuclear type 21)
(15046, RCV000000017, Mitochondrial complex I deficiency)
(15046, RCV000196589, not provided)

When we do this for all rows in the file, we need to combine all tuples together and deduplicate them (leave only unique tuples). In this final set, the number of tuples which mention a given trait name is the number of records associated with this trait, which should also be imported into the database and stored in a separate field.

The reason there can be non-unique tuples is because of different reference genomes. Sometimes there will be two records with the same AlleleID and with most columns containing identical or very similar values; the difference will only be in the reference genome (GRCh37/GRCh38) and chromosomal coordinates. This is why we do deduplication to count the number of records correctly.

Some considerations about the import:

Columns in the file can change order, so they should be extracted based on their names, not fixed position indexes.
Import should be performed periodically and automatically; and there should also be a way to trigger import manually (perhaps through a button or a separate page?)
Failure to import must generate some sort of a logging message which must go somewhere (we'll need to think through the logging mechanism, perhaps in a separate issue)
Care should be taken so that user modifications don't overlap with a database modification caused by a user.

Action checklist

[ ] Implement check for a maximum trait name, as discussed here https://github.com/EBIvariation/trait-curation/pull/35#discussion_r440160449

tskir commented 4 years ago

To clarify the scope here: as we discussed yesterday, this issue is specifically about backend logic to do the import. Import automation & manual triggering will be implemented separately.

joj0s commented 4 years ago

So just to clear up the process of selecting trait names and calculating the number of source records:

Inserting trait names: I select every single unique trait name that appears in the PhenotypeList column
Calculating source record number: For each AlleleID, I calculate every unique possible tuple with RCVaccession and PhenotypeList. The number of each trait name's appearance in those is its source record number.

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

joj0s commented 4 years ago

Another thing, is that I am excluding "not provided" values for both trait name imports and source record number calculation, should I treat those records differently?

tskir commented 4 years ago

Inserting trait names: I select every single unique trait name that appears in the PhenotypeList column

That's correct, provided that the PhenotypeList column values are exploded prior to that—i.e., split by ; character.

Calculating source record number: For each AlleleID, I calculate every unique possible tuple with RCVaccession and PhenotypeList. The number of each trait name's appearance in those is its source record number.

That's correct. Just to clarify, after calculating tuples per each AlleleID, it's also important to combine all of them together and then do the deduplication (remember, AlleleID is not unique per row).

To expand a bit on why we do this: the central object which is in the end submitted to Open Targets is an "evidence string". Each evidence string is defined by a tuple (trait, variant, ClinVar record). So the number of evidence strings generated by any given trait will depend on the total number of (variant, ClinVar record) tuples associated with it. This corresponds to (AlleleID, RCV) tuples in the source data.

Another thing, is that I am excluding "not provided" values for both trait name imports and source record number calculation, should I treat those records differently?

Yeah, that's a good question. ClinVar has a number of trait "names" which cannot be mapped to any ontology term. Most notably "not provided", but there are also things like "see cases", "other" and a couple of others. In this ticket you don't need to address those situations in any special way. In future, we will need a possibility for curators to mark the trait as "invalid", probably necessitating an additional status. This is not a high priority issue. I've added it to backlog: https://github.com/EBIvariation/trait-curation/issues/37

joj0s commented 4 years ago

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

Regarding this, what I am doing right now is that If a record already exists in the database I leave it as is, and I only insert new ones with their source record numbers. Let me know if I should change this.

Other than that, the script is ready. Should I add a button somewhere to trigger the import and submit it in a PR?

tskir commented 4 years ago

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

Regarding this, what I am doing right now is that If a record already exists in the database I leave it as is, and I only insert new ones with their source record numbers. Let me know if I should change this.

Oh, I missed that question, sorry. The correct behaviour in this case is what you described originally: insert new trait names with their record counts + also recalculate number of linked records for all existing traits.

Other than that, the script is ready. Should I add a button somewhere to trigger the import and submit it in a PR?

Yes, please do. I'm not sure where to put this button, though. There is an issue https://github.com/EBIvariation/trait-curation/issues/32 for designing the page for buttons triggering various processes, but it's a separate one. So for now, I guess you could just add the button anywhere where it is convenient. And then, once #32 and subsequent implementation issues are done, we'll move it there.

Another issue is logging. In case the import doesn't go through for some reason, we need a way to access the logs of the script. Do you have any ideas on how to do that best? For example, regarding the Heroku instances where all of this is running, do you have any way to access them (semi-) directly? Or could we make the script, once triggered, output its logs to the Javascript console or something?

joj0s commented 4 years ago

Another issue is logging. In case the import doesn't go through for some reason, we need a way to access the logs of the script. Do you have any ideas on how to do that best? For example, regarding the Heroku instances where all of this is running, do you have any way to access them (semi-) directly? Or could we make the script, once triggered, output its logs to the Javascript console or something?

The easiest way to view logs would be to output them to the server console. They can then be accessed via the Heroku CLI or simply by going to the 'logs' page in the Heroku webpage. For example https://dashboard.heroku.com/apps/clinvar-trai-prototype-pmia54t/logs

tskir commented 4 years ago

The easiest way to view logs would be to output them to the server console. They can then be accessed via the Heroku CLI or simply by going to the 'logs' page in the Heroku webpage. For example https://dashboard.heroku.com/apps/clinvar-trai-prototype-pmia54t/logs

OK, that's great, let's use this way for now. Maybe in the future we'll implement some user friendly logging, a status page, email notifications, or something similar. For now Heroku server logs will do just fine

EBIvariation / trait-curation

Implement import of trait names from ClinVar #14

Action checklist