genome-nexus / genome-nexus-annotation-pipeline

Library and tool for annotating MAF files using Genome Nexus Webserver API
MIT License
8 stars 27 forks source link

Many output errors and warnings when running the minimal example #193

Open inodb opened 2 years ago

inodb commented 2 years ago

Seems like things get annotated correctly but tons of errors about reading fields:

Something went wrong reading field Entrez_Gene_Id
Something went wrong reading field dbSNP_RS
Something went wrong reading field Match_Norm_Seq_Allele1
Something went wrong reading field Validation_Method
Something went wrong reading field Match_Norm_Seq_Allele2
Something went wrong reading field n_ref_count
Something went wrong reading field t_alt_count
Something went wrong reading field BAM_File
Something went wrong reading field Variant_Classification
Something went wrong reading field dbSNP_Val_Status
Something went wrong reading field Mutation_Status
Something went wrong reading field Matched_Norm_Sample_Barcode
Something went wrong reading field Validation_Status
Something went wrong reading field Variant_Type
Something went wrong reading field Strand
Something went wrong reading field Hugo_Symbol
Something went wrong reading field Sequencer
Something went wrong reading field n_alt_count
Something went wrong reading field Center
Something went wrong reading field Match_Norm_Validation_Allele2
Something went wrong reading field Tumor_Sample_Barcode
Something went wrong reading field Verification_Status
Something went wrong reading field t_ref_count
Something went wrong reading field Tumor_Seq_Allele2
Something went wrong reading field Match_Norm_Validation_Allele1
Something went wrong reading field Score
Something went wrong reading field Sequencing_Phase
Something went wrong reading field Tumor_Validation_Allele2
Something went wrong reading field Tumor_Validation_Allele1
Something went wrong reading field NCBI_Build
Something went wrong reading field Sequence_Source
Something went wrong reading field Entrez_Gene_Id
Something went wrong reading field dbSNP_RS
Something went wrong reading field Match_Norm_Seq_Allele1
Something went wrong reading field Validation_Method
Something went wrong reading field Match_Norm_Seq_Allele2
Something went wrong reading field n_ref_count
Something went wrong reading field t_alt_count
Something went wrong reading field BAM_File
Something went wrong reading field Variant_Classification
Something went wrong reading field dbSNP_Val_Status
Something went wrong reading field Mutation_Status
Something went wrong reading field Matched_Norm_Sample_Barcode
Something went wrong reading field Validation_Status
Something went wrong reading field Variant_Type
Something went wrong reading field Strand
Something went wrong reading field Hugo_Symbol
Something went wrong reading field Sequencer
Something went wrong reading field n_alt_count
Something went wrong reading field Center
Something went wrong reading field Match_Norm_Validation_Allele2
Something went wrong reading field Tumor_Sample_Barcode
Something went wrong reading field Verification_Status
Something went wrong reading field t_ref_count
Something went wrong reading field Tumor_Seq_Allele2
Something went wrong reading field Match_Norm_Validation_Allele1
Something went wrong reading field Score
Something went wrong reading field Sequencing_Phase
Something went wrong reading field Tumor_Validation_Allele2
Something went wrong reading field Tumor_Validation_Allele1
Something went wrong reading field NCBI_Build
Something went wrong reading field Sequence_Source
Something went wrong reading field Entrez_Gene_Id
Something went wrong reading field dbSNP_RS
Something went wrong reading field Match_Norm_Seq_Allele1
Something went wrong reading field Validation_Method
Something went wrong reading field Match_Norm_Seq_Allele2
Something went wrong reading field n_ref_count
Something went wrong reading field t_alt_count
Something went wrong reading field BAM_File
Something went wrong reading field Variant_Classification
Something went wrong reading field dbSNP_Val_Status
Something went wrong reading field Mutation_Status
Something went wrong reading field Matched_Norm_Sample_Barcode
Something went wrong reading field Validation_Status
Something went wrong reading field Variant_Type
Something went wrong reading field Strand
Something went wrong reading field Hugo_Symbol
Something went wrong reading field Sequencer
Something went wrong reading field n_alt_count
Something went wrong reading field Center
Something went wrong reading field Match_Norm_Validation_Allele2
Something went wrong reading field Tumor_Sample_Barcode
Something went wrong reading field Verification_Status
Something went wrong reading field t_ref_count
Something went wrong reading field Tumor_Seq_Allele2
Something went wrong reading field Match_Norm_Validation_Allele1
Something went wrong reading field Score
Something went wrong reading field Sequencing_Phase
Something went wrong reading field Tumor_Validation_Allele2
Something went wrong reading field Tumor_Validation_Allele1
Something went wrong reading field NCBI_Build
Something went wrong reading field Sequence_Source

Annotation Summary:
    Records with ambiguous SNP and INDEL allele changes:  0
    All variants annotated successfully without failures!
ozguzMete commented 2 years ago

These are basically missing columns in the example maf file. Interestingly, we got the column names from the file. We merge these column names with the predefined column names list inside MutationRecord. MutationRecord has 36 headers while the maf format has 126 columns in total... These 36 columns look like more "required" than the rest but not soo "required" by your comment

Do we really need to use the predefined column names list inside MutationRecord? If not we can solve the problem by removing this merge operation.

The code gets IllegalArgumentException since a column with the given name is not defined. we can simply suppress this exception since the annotation is "correct" -- or -- instead of logging an error we can log a warning

In both cases, we should be printing a clearer error/warn message when we got the IllegalArgumentException at that point and it should be this: No such column name: XXXXXXXX -- or -- Missing column: XXXXXXX

inodb commented 2 years ago

Let's change the behavior:

  1. Check if minimum 5 columns exist (chrom, start_pos, end_pos, ref, Tumor_Seq_Allele1). Same as data/minimal_example.in.txt
  2. Other columns can be safely ignored (we do want to keep them in the output file)
  3. For the output. Right now a lot of extra empty columns are outputted (see minimal_example.out.uniprot.txt). Maybe we can have an optional argument to indicate "only output new columns" e.g. --output-format minimal or --output-format mskcc. Maybe in the future a format file so you can add custom format files (https://github.com/genome-nexus/genome-nexus-annotation-pipeline/issues/194).