Closed hanars closed 3 years ago
@ShifaSZ to set up a meeting with @hanars asap to figure out a game plan.
Last week: 1) Loading more fields. See more details.
2) To understand the data better, tried loading and querying data with Hail. See more details.
The query results for all SVTYPEs
of variants with RD_CN
in project R0332_cmg_estonia_wgs
are: {'DEL': 24207, 'CPX': 36, 'CNV': 526, 'INS': 99, 'DUP': 24512}
This week:
1) Implement formating SVs
2) Implement add_in_silico
and export_to_elasticsearch
3) Compare the performance with Hail and without Hail?
Do not implement add_in_silico
as we have not received the file with the in silico scores/ have no reason to believe we will ever receive that file
Last week:
export-to-elasticsearch
function to load the annotated data to Elasticsearch.This week:
export_to_elasticsearch
gene_symbol
to gene_id
mapping. (possible ways to access the mapping table: load mapping data to local; access seqr Postgres DB through a new API; access an online mapping DB)@hanars I added the mapping for the gene_id by using the Gencode file downloaded from here. The code has been pushed to Github. There are some genes in which gene Ids can't be found in the Gencode file. For example, it can't find gene ids for gene symbols 'AL031847.2' and 'AL109811.4'. Do you know where can I find a better mapping file?
ask Harrison what version of gencode they used so we can use the same one, which should probably fix that issue
@hanars I've just uploaded a loading pipeline code with Hail and exporting to Elasticsearch. Its performance is much higher than using PyVCF. It took only a little more than 1 minute to export the data for a project with 100+ samples. See here for more details.
can you please create a table or text document comparing PyVCF and Hail performance for READING IN the VCF and for PARSING the VCF. I do not care about elasticsearch export time for this analysis. You can either comment here or in the google document you have going, but committing the output of a script commented out on a branch is not a usable method of documentation
I'm going to create a comparing table. The previous message is just preliminary information. Once the VCF file has been converted into a MatrixTable and saved into a file, Hail doesn't need to read in data for each project. All the data subsetting and annotating operations are done lazily which means the real operations will be delayed until exporting data to Elasticsearch. Therefore, the major part of the time consumption for the Hail pipeline is to export the data. And, the converting VCF to MatrixTable can be done once for all projects. That's why the exporting time for Hail is important and I would like to share it with you before documenting it.
Last week:
This week:
Last week:
This week:
This week:
- Continue adding unit tests
- Create a PR when unit tests are done
@ShifaSZ as I mentioned in our meeting on Friday, it is important to get a PR up sooner rather than later and I would rather get one up without unit tests faster and then you can work on adding the unit tests while it is under review
After the merge we will create a ticket to - ask analysts to pick an initial project to pilot and then ask them for feedback....
Last week:
This week:
Last week:
This week:
point 2 for this week needs to be tracked via separate tickets as discussed in our meeting last week
Last week:
This week:
@hanars You changed this task back to In Progress
from Review/QA
. What's that for? Are there any things I still need to do for this task? Thanks.
Decision to go with Hail after analysis (see below for the analysis discussion and links).
Acceptance Criteria: