Minimize the metadata of VCF at the beginning of the workflow

jylee-bcm commented 2 months ago
To avoid the situation that the workflow fails to handle vcf files in a wrong format, especially with the header information, which are mostly not used for our workflow at all.
Specifically, our workflow extensively uses both software of bcftools and tabix, but when the header information is in wrong format, most of commands either of bcftools or tabix used to fail.
I would like to ask your review about:
@hyunhwan-bcm, please review if the new logic is placed properly, or suggest the other places for me.
@arine, please suggest me if this code needs more test cases: currently I have tested with demo data and the most recently reported data.
arine commented 2 months ago
Test000 (s3://aim-test-data/test000) failed with
Caused by:                                                                                                       
  Process `ANNOTATE_BY_MODULES (chr1)` terminated with an error exit status (1)                                  

Command executed:                                                                                                

  feature.py \                                                                                                   
      -patientHPOsimiOMIM omim_sim.tsv \                                                                         
      -patientHPOsimiHGMD hgmd_sim.tsv \                                                                         
      -varFile chr1.vcf-vep.txt \                                                                                
      -inFileType vepAnnotTab \                                                                                  
      -patientFileType one \                                                                                     
      -genomeRef hg19 \                                                                                          
      -diseaseInh AD \                                                                                                 -modules curate,conserve                                                                                                                                                                                                    
      mv scores.csv chr1.vcf-vep_scores.csv                                                                      

Command exit status:                                                                                             
  1                                                                                                              

Command output:
  input file: chr1.vcf-vep.txt                                                                                   
  type of input file: vepAnnotTab                                                                                
  modules: curate,conserve
  modules list: ['curate', 'conserve']                                                                           
  patientHPOsimi-OMIM dimension: (6393, 7)                                                                       
  patientHPOsimi-HGMD dimension: (346526, 6)                                                                     
  reading DGV flat file
  finsihed reading DGV                                                                                           
  reading Decipher flat file                                                                                     
  finsihed reading DECIPHER
  input annoatated varFile: chr1.vcf-vep.txt                                                                     
  shape: (0, 735)
  found GERP++RS
  found GERP++NR
  pipeline time: 5.037638425827026                                                                               
  log file name: log.txt
  input read time: 0.09162592887878418                                                                           
  input num rows: 0
  m: ['curate', 'conserve']
  Score re-calculation:

Command error:
  input file: chr1.vcf-vep.txt                                                                                   
  type of input file: vepAnnotTab                                                                                
  modules: curate,conserve
  modules list: ['curate', 'conserve']                                                                           
  patientHPOsimi-OMIM dimension: (6393, 7)                                                                       
  patientHPOsimi-HGMD dimension: (346526, 6)                                                                     
  reading DGV flat file
  finsihed reading DGV
  reading Decipher flat file                                                                                     
  finsihed reading DECIPHER
  input annoatated varFile: chr1.vcf-vep.txt                                                                     
  shape: (0, 735)
  found GERP++RS
  found GERP++NR
  pipeline time: 5.037638425827026              
  log file name: log.txt                                                                                        
  input read time: 0.09162592887878418                                                                          
  input num rows: 0                                                                                             
  m: ['curate', 'conserve']                                                                                     
  Score re-calculation:                                                                                         
  /home/sunyoung/tmp/tmpf9kb21es/bin/feature.py:194: DtypeWarning: Columns (0) have mixed types. Specify dtype o
ption on import or set low_memory=False.                                                                        
    dgvDf = pd.read_csv(fileName, sep=",")                                                                      
  /home/sunyoung/tmp/tmpf9kb21es/bin/feature.py:260: FutureWarning: The error_bad_lines argument has been deprec
ated and will be removed in a future version. Use on_bad_lines in the future.                                   

    varDf = pd.read_csv(                                                                                        
  Traceback (most recent call last):                                                                            
    File "/home/sunyoung/tmp/tmpf9kb21es/bin/feature.py", line 423, in <module>                                 ]      main()                                                                                                         File "/home/sunyoung/tmp/tmpf9kb21es/bin/feature.py", line 411, in main                                            score = load_raw_matrix(annotateInfoDf)                                                                        File "/home/sunyoung/tmp/tmpf9kb21es/bin/annotation/marrvel_score_recalc.py", line 90, in load_raw_matrix          return score.loc[:, raw_features].copy()                                                                       File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 961, in __getitem__                    return self._getitem_tuple(key)                                                                           '    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 1149, in _getitem_tuple         h      return self._getitem_tuple_same_dim(tup)                                                                  m    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 827, in _getitem_tuple_same_dim t      retval = getattr(retval, self.name)._getitem_axis(key, axis=i)                                                 File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 1191, in _getitem_axis          o      return self._getitem_iterable(key, axis=axis)                                                             n    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 1132, in _getitem_iterable       
      keyarr, indexer = self._get_listlike_indexer(key, axis)                                                   
    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py", line 1327, in _get_listlike_indexer  
      keyarr, indexer = ax._get_indexer_strict(key, axis_name)                                                  
    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
      self._raise_if_missing(keyarr, indexer, axis_name)                                                        
    File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 5845, in _raise_if_missing  
      raise KeyError(f"{not_found} not in index")                                                               
  KeyError: "['chrom', 'pos', 'varId', 'varId_dash', 'zyg', 'geneSymbol', 'geneEnsId', 'gnomadAF', 'gnomadAFg', ''symptomName', 'omimSymptomSimScore', 'omimSymMatchFlag', 'hgmdSymptomScore', 'hgmdSymptomSimScore', 'hgmdSymMathchFlag', 'clinVarSymMatchFlag', 'gnomadGeneZscore', 'gnomadGenePLI', 'gnomadGeneOELof', 'gnomadGeneOELofUpper', m'omimGeneFound', 'omimVarFound', 'hgmdGeneFound', 'hgmdVarFound', 'clinVarVarFound', 'clinVarGeneFound', 'clinvatrTotalNumVars', 'clinvarNumP', 'clinvarNumLP', 'clinvarNumLB', 'clinvarNumB', 'clinvarSignDesc', 'clinvarConditi on', 'dgvVarFound', 'decipherVarFound', 'curationScoreHGMD', 'curationScoreOMIM', 'curationScoreClinVar', 'conseorvationScoreDGV', 'conservationScoreGnomad', 'conservationScoreOELof', 'hom', 'hgmd_rs', 'clin_dict', 'clin_PLP'n, 'clin_PLP_perc', 'spliceAImax', 'clin_code', 'hgmd_id', 'rsId', 'phenoList', 'phenoInhList'] not in index"    

Work dir:                                                                                                    
  /home/sunyoung/tmp/tmpf9kb21es/work/40/de07ad4c4753d655e16b6a9e874c98                                         

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run``                                                                                    

 -- Check '.nextflow.log' file for details
jylee-bcm commented 2 months ago
@arine Thanks for testing! I also found the error exist so just fixed and pushed.
LiuzLab / AI_MARRVEL

Minimize the metadata of VCF at the beginning of the workflow #94