PEDIA-Charite / PEDIA-workflow

PEDIA-workflow prioritizes the disease-causing gene by integrating facial analysis and exome sequencing data.
Other
7 stars 8 forks source link

The output from QC and phenomization #15

Open la60312 opened 6 years ago

la60312 commented 6 years ago

The output should contain the following files.

  1. phenomized JSON files in data/PEDIA/jsons/phenomized
  2. Each file which passed QC should has a VCF file in data/PEDIA/mutations. The VCF is converted from HGVS code.
  3. config.yml contains two lists. one is the SINGLE_SAMPLES. Another one is VCF_SAMPLES. The files which passed QC should be listed here. The sample with VCF goes to VCF_SAMPLES. The sample without VCF goes to SINGLE_SAMPLES.
SINGLE_SAMPLES:
 - 140807
 - 35305
 - 155130
 - 82000
VCF_SAMPLES:
 - 21147
 - 66964
  1. A report which listed the samples failed QC at each step. Then we are able to trace back to fix those samples.
xiamaz commented 6 years ago

I will start working on these issues starting next wednesday, is this soon enough?

la60312 commented 6 years ago

The deadline is on 9th of March. We will have the first test. The highest priority is to be able to run the whole workflow. The report is not so urgent.

xiamaz commented 6 years ago

Currently VCF generation has not been implemented in the new Pipeline.

@GSalma has been testing Jannovar and has found multiple issues, that might block us using it in the short term. @jakobhertzberg has already written a multiVCF generator in his pipeline.

Since Tzung has been able to use the multiVCF generated by Jakob, it currently should be the best, if we ported this to the new pipeline. @jakobhertzberg, could you add the VCF generation to the new QC? I think just adding the function in the lib and using it in the hgvs validation in lib/model/hgvs.py. Then we could filter the cases on the number of valid hgvs. I would be great if we could simply add the vcf as an attribute to the case object. Then we could just save it later when doing the case to oldJson conversion.

la60312 commented 6 years ago

The multiVCF is done by me. I used the HGVS library from biocommons, because you used this library to check HGVS code. There are some issues with HGVS library, too. For the demo last week, I just excluded some cases manually. Moreover, it doesn't provide the function to convert HGVS to VCF. I used the genomic position and the dup, del or sub information to convert HGVS to VCF format. I don't know if I convert it correctly with indel.

jakob-he commented 6 years ago

Which problems did you have with the HGVS library? I could filter the HGVS codes that can not be mapped to a genomic variant with the HGVS library and then construct a VCF dataframe for each case object from the variant information.

jakob-he commented 6 years ago

382 of the current HGVS codes that passed the new QC can be mapped to a genomic variant with the hgvs library and then used in the vcf file generation. However, the construction of a VCF file based on the genomic variant information is really basic at this point. So some of those vcf files may be erroneous. I think it might be better to use Jannovar to reduce the possibility of an error. Especially if we manage to convert the same amount of HGVS codes.

la60312 commented 6 years ago

Please refer the attached python file. I used this script to generate multiVCF last Tuesday. I didn't have time to track error because Peter needed the JSON files for presentation. I just remove the sample with problem manually.

There are some samples with the type 'identity'. It means the ref and alt are the same. The following json files are the files which can not be converted to VCF format. In theory, you should get at least 414 files passed QC.

                   if var_type == "identity":
                        print(filename)

103264.json 107857.json 131126.json 146038.json 31531.json 44942.json 45143.json 62286.json 81012.json 107819.json 114197.json 131420.json 155433.json 34235.json 45134.json 45148.json 65229.json 81048.json 107824.json 114666.json 140813.json 155735.json 36436.json 45135.json 45251.json 65747.json 81057.json 107832.json 124162.json 141354.json 204367.json 37083.json 45136.json 45260.json 65749.json 81723.json 107833.json 127502.json 143331.json 204368.json 37087.json 45138.json 51545.json 73413.json 81729.json 107839.json 128546.json 143332.json 204370.json 39261.json 45141.json 51716.json 80730.json 84222.json 107845.json 130971.json 144471.json 31445.json 44840.json 45142.json 62099.json 81011.json 87685.json

get_multiVCF.txt

la60312 commented 6 years ago

The 414 files I mentioned are from the new_QC last week. Max provided me with 489 files which were passed new_qc. There were only 414 files which can be converted to VCF format. Therefore, I would think you should get at least this number of files.

As you can see in the get_multiVCF.py. I converted HGVS to VCF in a very simple and ugly way. Hence, I would suggest we switch to Jannovar.

jakob-he commented 6 years ago

I just committed the change to the case class: ef8ae6122a8f6d2e915432a8ce8dfe7432315798 if you want to have a look. The function is very similar to yours. Jannovar is probably the better choice. I think Salma is currently working on it. @xiamaz How did you get 489 cases through QC? I currently only get 376.

GSalma commented 6 years ago

Jannovar has a bug with fetching the reference sequence which I discussed with Max (bih) on the Jannovar HGVS to VCF conversion issue so at the moment it seems that we can only convert a small fraction using it.