aradenbaugh / radia

RADIA: RNA and DNA Integrated Analysis for Somatic Mutation Detection
GNU Affero General Public License v3.0
29 stars 11 forks source link

Allele error #9

Closed msubirana closed 4 years ago

msubirana commented 4 years ago

Hi I run RADIA in my samples and I got this alleles in the calling:

A B C G K M R S T W Y

Is it normal? How can I deal with them? During the postporecessing of the vcf I am getting a lot of errors due to them.

aradenbaugh commented 4 years ago

Hi, I'm not sure if I understand exactly what you mean. These look like they could be the IUPAC codes that are used to encode the consensus sequence of a population of aligned sequences, for example in phylogenetic analysis or in some BLAST searches. I have also seen IUPAC codes used in INDELs. Are you saying that you are seeing these nucleotide codes as the alternative allele in the VCF? Can you send me a few things (please don't share any private data)?

An example of:

  1. The VCF line of one of these calls
  2. The actual error message that you receive in post processing
  3. The samtools mpileup of the region where the call is made. For example, use the following command on the coordinate of the VCF call that you provided in (1):

samtools mpileup -f /path/to/fasta/fasta.fa -Q 0 -q 0 -r chr1:start-stop /path/to/bams/myBam.bam

Please include a range of bases up- and down-stream from the coordinate to account for nearby INDELs. (Note: you may or may not need the chr prefix depending on your data and corresponding reference file.

It would also be great if you could provide a bit of background on the sequences that you are using. Are they suppose to be DNA or RNA sequences with only the typical {A,C,G,T,N} nucleotides?

msubirana commented 4 years ago

Hi,

Yes, the problem is that I am getting IUPAC codes in the ALT allele. I include all the information that you asked for, below:

Using awk I am getting the following output:

bcftools view -H out_radia.vcf | awk '{print $4}' | sort | uniq -c

12856551 A 2 B 4195344 C 4228757 G 7 K 8 M 22 R 3 S 12852425 T 11 W 31 Y

1. Examples vcf lines where there are these errors:

chr10 124122080 . B C,T 0 PASS AC=35,44;AF=0.44,0.56;AN=3;BQ=39;DEL=0;DP=79;FA=1;INS=0;MC=B>C,B>T;MQ=60;MQ0=0;MT=GERM,GERM;NS=2;SB=0.48;SS=1;START=0;STOP=2;VT=SNP GT:DP:AD:AF:INS:DEL:DP4:START:STOP:MQ0:MMQ:MQA:BQ:SB:MMP 1/2:34:0,15,19:0,0.44,0.56:0:0:0,0,6,9,10,9:0:1:0,0,0:0,60,60:0,60,60:0,40,39:0,0.4,0.53:.,.,. 1/2:45:0,20,25:0,0.44,0.56:0:0:0,0,9,11,13,12:0:1:0,0,0:0,60,60:0,60,60:0,40,39:0,0.45,0.52:.,.,.

chr3 16902883 . B T,G 0 PASS AC=78,2;AF=0.97,0.03;AN=3;BQ=16;DEL=0;DP=80;FA=1;INS=0;MC=B>T,B>G,T>G;MQ=60;MQ0=0;MT=GERM,SOM,SOM;NS=2;SB=0.44;SS=1;START=0;STOP=0;VT=SNP GT:DP:AD:AF:INS:DEL:DP4:START:STOP:MQ0:MMQ:MQA:BQ:SB:MMP 1/1:29:0,29,0:0,1,0:0:0:0,0,7,22,0,0:0:0:0,0,0:0,60,0:0,60,0:0,16,0:0,0.24,0:.,.,. 1/1:51:0,49,2:0,0.96,0.04:0:0:0,0,26,23,2,0:0:0:0,0,0:0,60,60:0,60,60:0,15,17:0,0.53,1:.,.,.

chr1 248752514 . M C 0 PASS AC=58;AF=1;AN=2;BQ=38;DEL=0;DP=58;FA=1;INS=0;MC=M>C;MQ=60;MQ0=1;MT=GERM;NS=2;SB=0.5;SS=1;START=0;STOP=0;VT=SNP GT:DP:AD:AF:INS:DEL:DP4:START:STOP:MQ0:MMQ:MQA:BQ:SB:MMP 1/1:26:0,26:0,1:0:0:0,0,10,16:0:0:0,1:0,60:0,60:0,40:0,0.38:.,. 1/1:32:0,32:0,1:0:0:0,0,19,13:0:0:0,0:0,60:0,60:0,37:0,0.59:.,.

chr10 39245283 . M A 0 PASS AC=91;AF=1;AN=2;BQ=38;DEL=0;DP=91;FA=1;INS=0;MC=M>A;MQ=60;MQ0=0;MT=GERM;NS=2;SB=0.49;SS=1;START=2;STOP=0;VT=SNP GT:DP:AD:AF:INS:DEL:DP4:START:STOP:MQ0:MMQ:MQA:BQ:SB:MMP 1/1:41:0,41:0,1:0:0:0,0,23,18:1:0:0,0:0,60:0,60:0,37:0,0.56:.,. 1/1:50:0,50:0,1:0:0:0,0,22,28:1:0:0,0:0,60:0,60:0,39:0,0.44:.,.

chr10 39285662 . M C 0 PASS AC=84;AF=1;AN=2;BQ=38;DEL=0;DP=84;FA=1;INS=0;MC=M>C;MQ=60;MQ0=0;MT=GERM;NS=2;SB=0.55;SS=1;START=2;STOP=0;VT=SNP GT:DP:AD:AF:INS:DEL:DP4:START:STOP:MQ0:MMQ:MQA:BQ:SB:MMP 1/1:40:0,40:0,1:0:0:0,0,22,18:0:0:0,0:0,60:0,60:0,37:0,0.55:.,. 1/1:44:0,44:0,1:0:0:0,0,24,20:2:0:0,0:0,60:0,60:0,39:0,0.55:.,.

2. The actual error is using gatk4 SelectVaraints:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2710417: unparsable vcf record with allele M, for input source: /imppc/labs/lplab/share/marc/insulinomas/processed/hg38/vcf/calling_ergWgsTools/radia/NET-10_out_radia.vcf at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887) at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678) at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:640) at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443) at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384) at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328) at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37) at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:373) at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354) at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205) at org.broadinstitute.hellbender.Main.main(Main.java:291)

3. samtools mpileup

I attach the result. I did the calling 1k bp arrount the first error for avoiding generate enormous output files.

samtools mpileup -f $ISHARE/marc/refgen/hg38/hg38.fa -Q 0 -q 0 -r chr1:248751514-248754514 $ISHARE/marc/insulinomas/processed/hg38/bam/bwa/to_do/NET-10_TI.bam > samtools_mpileup_TI_sample_chr1_248751514_248754514.vcf

It would also be great if you could provide a bit of background on the sequences that you are using. Are they suppose to be DNA or RNA sequences with only the typical {A,C,G,T,N} nucleotides?

The sequences are DNA with the typical A,C,G,T,N nucleotides. They are paired-end tumor control wgs cancer samples coming from pancreas.

I you neeed something else let me know! Thanks!

On Mon, Feb 17, 2020 at 6:27 AM aradenbaugh notifications@github.com wrote:

Hi, I'm not sure if I understand exactly what you mean. These look like they could be the IUPAC codes that are used to encode the consensus sequence of a population of aligned sequences, for example in phylogenetic analysis or in some BLAST searches. I have also seen IUPAC codes used in INDELs. Are you saying that you are seeing these nucleotide codes as the alternative allele in the VCF? Can you send me a few things (please don't share any private data)?

An example of:

  1. The VCF line of one of these calls
  2. The actual error message that you receive in post processing
  3. The samtools mpileup of the region where the call is made. For example, use the following command on the coordinate of the VCF call that you provided in (1):

samtools mpileup -f /path/to/fasta/fasta.fa -Q 0 -q 0 -r chr1:start-stop /path/to/bams/myBam.bam

Please include a range of bases up- and down-stream from the coordinate to account for nearby INDELs. (Note: you may or may not need the chr prefix depending on your data and corresponding reference file.

It would also be great if you could provide a bit of background on the sequences that you are using. Are they suppose to be DNA or RNA sequences with only the typical {A,C,G,T,N} nucleotides?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aradenbaugh/radia/issues/9?email_source=notifications&email_token=ALO45HYUI7JSTLCAVUYEIVLRDIN27A5CNFSM4KUVI7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5DQKY#issuecomment-586823723, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALO45H63ZYEOMPYLYX6IF43RDIN27ANCNFSM4KUVI7IA .

aradenbaugh commented 4 years ago

Thanks for all of the information. In the examples you provided, it looks like the IUPAC codes are only in the reference (REF) column, not in the alternative allele (ALT) column. I don't see the samtools mpileup attachment to confirm. There is an NCBI hg38 reference fasta file that uses IUPAC ambiguous codes for bases that could not be identified as {A,G,C,T}. It seems like you're using an hg38 reference file with the IUPAC ambiguous codes. The UCSC hg38 genome assembly process converts all the IUPAC ambiguity characters to 'N'. You may want to use a different hg38 reference file that doesn't include the IUPAC codes during the mapping process and when running RADIA. Or, convert all non {A,G,C,T} reference alleles to 'N' in the VCF. Or, there seem to be some other scripts out there that try to map the IUPAC codes to multiple reference alleles to allow for variation.

Here are some possibly interesting links about this issue:

http://genome.cse.ucsc.edu/cgi-bin/hgTrackUi?hgsid=789969291_9KAS83RUaPEAP1TidGZdaouKZ5ef&c=chr1&g=gold https://www.biostars.org/p/114614/#116514 https://www.biostars.org/p/126118/ https://www.biostars.org/p/65885/

msubirana commented 4 years ago

Perfect, I just filtered the ambiguous positions and now is working.

Now, filtering the output of calling I got this error:

/imppc/labs/lplab/share/marc/repos/miniconda/bin/python2.7 /imppc/labs/lplab/share/marc/repos/radia/scripts/filterRadia.py NET-10 chr22 /imppc/labs/lplab/share/marc/insulinomas/processed/hg38/vcf/calling_ergWgsTools2/radia/NET-10_out_radia_chr22.vcf /imppc/labs/lplab/share/marc/repos/radia/scripts /imppc/labs/lplab/share/marc/insulinomas/processed/hg38/vcf/calling_ergWgsTools2/radia/filtered --dnaOnly --noSnpEff --noRadar --noDarned --ignoreScriptsDir -b /imppc/labs/lplab/share/marc/repos/radia/data/hg38/blacklists/1000Genomes/phase3 -d /imppc/labs/lplab/share/marc/repos/radia/data/hg38/snp151 -r /imppc/labs/lplab/share/marc/repos/radia/data/hg38/retroGenes -t /imppc/labs/lplab/share/marc/repos/radia/data/hg38/gencode/basic -p /imppc/labs/lplab/share/marc/repos/radia/data/hg38/pseudoGenes -c /imppc/labs/lplab/share/marc/repos/radia/data/hg38/cosmic

('Cannot read the file, check to see if the path exists: ', '/imppc/labs/lplab/share/marc/repos/radia/data/hg38/blacklists/1000Genomes/phase3/chrchr22.bed')

Any suggestion?

On Wed, Feb 19, 2020 at 7:17 AM aradenbaugh notifications@github.com wrote:

Thanks for all of the information. In the examples you provided, it looks like the IUPAC codes are only in the reference (REF) column, not in the alternative allele (ALT) column. I don't see the samtools mpileup attachment to confirm. There is an NCBI hg38 reference fasta file that uses IUPAC ambiguous codes for bases that could not be identified as {A,G,C,T}. It seems like you're using an hg38 reference file with the IUPAC ambiguous codes. The UCSC hg38 genome assembly process converts all the IUPAC ambiguity characters to 'N'. You may want to use a different hg38 reference file that doesn't include the IUPAC codes during the mapping process and when running RADIA. Or, convert all non {A,G,C,T} reference alleles to 'N' in the VCF. Or, there seem to be some other scripts out there that try to map the IUPAC codes to multiple reference alleles to allow for variation.

Here are some possibly interesting links about this issue:

http://genome.cse.ucsc.edu/cgi-bin/hgTrackUi?hgsid=789969291_9KAS83RUaPEAP1TidGZdaouKZ5ef&c=chr1&g=gold https://www.biostars.org/p/114614/#116514 https://www.biostars.org/p/126118/ https://www.biostars.org/p/65885/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aradenbaugh/radia/issues/9?email_source=notifications&email_token=ALO45HYMSSXM2KLLFUFT3DDRDTFHVA5CNFSM4KUVI7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMGQEHQ#issuecomment-588055070, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALO45HZT66J4FKKGQNGLEYDRDTFHVANCNFSM4KUVI7IA .

aradenbaugh commented 4 years ago

Great, glad to hear you were able to filter the ambiguous positions.

The next error should be pretty easy to fix. Instead of specifying 'chr22' for the chromosome, just use '22' (i.e. drop the 'chr'). In the future, I will fix the code to check if the user has already specified the 'chr' or not. Hope that works for you! Let me know if you run into any other issues.