Create loading pipeline for genome SVs

hanars commented 3 years ago

Decision to go with Hail after analysis (see below for the analysis discussion and links).

Acceptance Criteria:

Merge PR with loading pipeline for SVs.
This merge will be reviewed by MWilson (or someone on the pipeline team - focused on Hail). Hana will also review.
This will enable ALL genome SVs in a genomes SV VCF file. (this will produce an index in ES - followup will be needed to enable seqr to use this in a productive manner).

larrybabb commented 3 years ago

@ShifaSZ to set up a meeting with @hanars asap to figure out a game plan.

ShifaSZ commented 3 years ago

Last week: 1) Loading more fields. See more details.

2) To understand the data better, tried loading and querying data with Hail. See more details. The query results for all SVTYPEs of variants with RD_CN in project R0332_cmg_estonia_wgs are: {'DEL': 24207, 'CPX': 36, 'CNV': 526, 'INS': 99, 'DUP': 24512}

This week: 1) Implement formating SVs 2) Implement add_in_silico and export_to_elasticsearch 3) Compare the performance with Hail and without Hail?

hanars commented 3 years ago

Do not implement add_in_silico as we have not received the file with the in silico scores/ have no reason to believe we will ever receive that file

ShifaSZ commented 3 years ago

Last week:

Updated the data fields according to the updated requirement doc.
Implemented formating SVs. Click here to see examples of formating results.
Studied Hail for loading data. Findings: collecting the required data by filtering and annotating the Hail MatrixTable has a very high performance (in a few minutes). But, extracting data JSON from the MT by iterating the rows is very slow (more than 10 times slower than loading with PyVCF). If we use Hail, we have to use Hail's built-in export-to-elasticsearch function to load the annotated data to Elasticsearch.

This week:

Implement export_to_elasticsearch
Implement gene_symbol to gene_id mapping. (possible ways to access the mapping table: load mapping data to local; access seqr Postgres DB through a new API; access an online mapping DB)
Add unit tests.

ShifaSZ commented 3 years ago

@hanars I added the mapping for the gene_id by using the Gencode file downloaded from here. The code has been pushed to Github. There are some genes in which gene Ids can't be found in the Gencode file. For example, it can't find gene ids for gene symbols 'AL031847.2' and 'AL109811.4'. Do you know where can I find a better mapping file?

hanars commented 3 years ago

ask Harrison what version of gencode they used so we can use the same one, which should probably fix that issue

ShifaSZ commented 3 years ago

@hanars I've just uploaded a loading pipeline code with Hail and exporting to Elasticsearch. Its performance is much higher than using PyVCF. It took only a little more than 1 minute to export the data for a project with 100+ samples. See here for more details.

hanars commented 3 years ago

can you please create a table or text document comparing PyVCF and Hail performance for READING IN the VCF and for PARSING the VCF. I do not care about elasticsearch export time for this analysis. You can either comment here or in the google document you have going, but committing the output of a script commented out on a branch is not a usable method of documentation

ShifaSZ commented 3 years ago

I'm going to create a comparing table. The previous message is just preliminary information. Once the VCF file has been converted into a MatrixTable and saved into a file, Hail doesn't need to read in data for each project. All the data subsetting and annotating operations are done lazily which means the real operations will be delayed until exporting data to Elasticsearch. Therefore, the major part of the time consumption for the Hail pipeline is to export the data. And, the converting VCF to MatrixTable can be done once for all projects. That's why the exporting time for Hail is important and I would like to share it with you before documenting it.

ShifaSZ commented 3 years ago

Last week:

Added gene symbols to gene ids mapping
Added exporting to Elasticsearch for both candidate solutions
Comparing the performances of the two solutions. Here is the analysis.

This week:

Add unit tests
Create a PR and solving possible issues

ShifaSZ commented 3 years ago

Last week:

We decided to go with Hail for VCF data importing and annotating. Now, the branch with Hail is genome_sv
Started adding unit tests

This week:

Continue adding unit tests
Create a PR when unit tests are done

hanars commented 3 years ago

This week:

Continue adding unit tests

Create a PR when unit tests are done

@ShifaSZ as I mentioned in our meeting on Friday, it is important to get a PR up sooner rather than later and I would rather get one up without unit tests faster and then you can work on adding the unit tests while it is under review

larrybabb commented 3 years ago

After the merge we will create a ticket to - ask analysts to pick an initial project to pilot and then ask them for feedback....

ShifaSZ commented 3 years ago

Last week:

Created a PR and completed a round of reviewing
Added more unit tests

This week:

Push all unit tests for review
Work on a pilot project?

ShifaSZ commented 3 years ago

Last week:

I spent most of my time fixing the issues found in the PR reviews.
Updated unit tests

This week:

Complete unit tests and close all review comments and merge the PR (link)
Add genome structural variant supports for searching and displaying (created new issue #1811 )

hanars commented 3 years ago

point 2 for this week needs to be tracked via separate tickets as discussed in our meeting last week

ShifaSZ commented 3 years ago

Last week:

Completed the unit tests and is ready for final review. (PR link)

This week:

Final review
Confirm the formats of the data imported in ElasticSearch

ShifaSZ commented 3 years ago

@hanars You changed this task back to In Progress from Review/QA. What's that for? Are there any things I still need to do for this task? Thanks.

broadinstitute / seqr

Create loading pipeline for genome SVs #1608