cauyrd / transIndel

Indel caller for DNA-seq or RNA-seq
GNU General Public License v3.0
14 stars 7 forks source link

Duplicate sites and missing heterozygotes #2

Open rrlove opened 6 years ago

rrlove commented 6 years ago

I’ve run transIndel on a BAM containing a single specimen as a test before I expand to my whole cohort, with this command:

python ~/local/bin/transIndel/transIndel_build_DNA.py \ -i ${sample_path}/${sample}.bam \ -o ${sample_path}/${sample}_tIbuild.bam

python ~/local/bin/transIndel/transIndel_call.py \ -i ${sample_path}/${sample}_tIbuild.bam \ -o ${sample_path}/${sample}_tI \ -c 1 \ -f 0.5 \ -l 1

Every variant in the resulting VCF is reported as homozygous, which is not realistic and does not match the results of other variant callers. In addition, about 142 sites out of 440,000 total are reported twice with different variants. For example:

2R      1931135 .       T       TC      .       .       NS=1;AO=7;DP=14;AB=0.5;SVLEN=1;SVTYPE=INS;SVMETHOD=transIndel_ALN;CHR2=2R;END=1931136   GT     1/1

2R      1931135 .       .       \<DEL>   .     .       NS=1;AO=7;DP=14;AB=0.5;SVLEN=1;SVTYPE=DEL;SVMETHOD=transIndel_ALN;CHR2=2R;END=1931136   GT     1/1

I chose -f 0.5 given that in a single sample, variants are present at frequencies of 0, 0.5, or 1. However, I also tried -f 0.4 in case there was some stochasticity in the calculation of allele frequency, with similar results but more (about 5,500 out of 620,000) duplicate sites.

Is there a way to recover the lost heterozygotes and address the duplicate sites?

Also, is there an option to report the ref and alt alleles for deletions, rather than an empty field and \<DEL> respectively?

Thanks.

cauyrd commented 6 years ago

At current stage, please discard the GT field in the VCF file, there is no genotyping stage in transindel. So I just set it to 1/1 no matter what it really is. You may determine the genotype by reported AO/DP/AB. In current version, we only report the ref and alt alleles for insertion. The genotyping and report deletion ref and alt alleles will be added in the next version.

For the duplication sites, could you send me a VCF file? I can double check.

On Thu, May 31, 2018 at 9:51 AM, rrlove notifications@github.com wrote:

I’ve run transIndel on a BAM containing a single specimen as a test before I expand to my whole cohort, with this command:

python ~/local/bin/transIndel/transIndel_build_DNA.py -i ${sample_path}/${sample}.bam -o ${sample_path}/${sample}_tIbuild.bam

python ~/local/bin/transIndel/transIndel_call.py -i ${sample_path}/${sample}_tIbuild.bam -o ${sample_path}/${sample}_tI -c 1 -f 0.5 -l 1

Every variant in the resulting VCF is reported as homozygous, which is not realistic and does not match the results of other variant callers. In addition, about 142 sites out of 440,000 total are reported twice with different variants. For example:

2R 1931135 . T TC . . NS=1;AO=7;DP=14;AB=0.5;SVLEN=1;SVTYPE=INS;SVMETHOD= transIndel_ALN;CHR2=2R;END=1931136 GT 1/1

2R 1931135 . . . . NS=1;AO=7;DP=14;AB=0.5;SVLEN=1;SVTYPE=DEL;SVMETHOD= transIndel_ALN;CHR2=2R;END=1931136 GT 1/1

I chose -f 0.5 given that in a single sample, variants are present at frequencies of 0, 0.5, or 1. However, I also tried -f 0.4 in case there was some stochasticity in the calculation of allele frequency, with similar results but more (about 5,500 out of 620,000) duplicate sites.

Is there a way to recover the lost heterozygotes and address the duplicate sites?

Also, is there an option to report the ref and alt alleles for deletions, rather than an empty field and respectively?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cauyrd/transIndel/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMD4wjLsCwKzBGOPINDFTPf5gareOUBks5t4ANcgaJpZM4UVMJq .

rrlove commented 6 years ago

Attached is a partial VCF containing some of the duplicated sites.

transIndel_test_output.vcf.txt