Clinical-Genomics / stranger

Tool to annotate outfiles from ExpansionHunter and TRGT with the pathologic implications of the repeat
MIT License
30 stars 6 forks source link

Different output for repeats with missing info #76

Closed Stikus closed 1 week ago

Stikus commented 1 week ago

Hello, looks like Stranger output has changed between 0.8.1 and 0.9.1 versions (0.9.0 version have errors, fixed in 0.9.1 - we cannot get result from it easily);

Log from 0.8.1 version:

Command: '/soft/Stranger-0.8.1/local/bin/stranger --repeats-file /input/tests/input_data/variants.json /outputs/0.8.1_2/test.expansion-hunter.vcf > /outputs/0.8.1_2/test.expansion-hunter.stranger.vcf'.
2024-09-25 14:56:49 14c41598f761 stranger.cli[2642] INFO Running stranger version 0.8.1
2024-09-25 14:56:49 14c41598f761 stranger.cli[2642] INFO Parsing repeats file /input/tests/input_data/variants.json
2024-09-25 14:56:49 14c41598f761 stranger.utils[2642] WARNING Repeat number 1 (SNV_AND_STR) has multiple 'ReferenceRegion' but no 'PathologicRegion'. Skipping
..
2024-09-25 14:56:49 14c41598f761 stranger.cli[2642] INFO Parsing variants from /outputs/0.8.1_2/test.expansion-hunter.vcf
2024-09-25 14:56:49 14c41598f761 stranger.utils[2642] WARNING No info for repeat id SNV_AND_STR_chr1:2000-2001
2024-09-25 14:56:49 14c41598f761 stranger.utils[2642] WARNING No info for repeat id SNV_AND_STR_chr1:2005-2008

Log from 0.9.1 version:

Command: '/soft/Stranger-0.9.1/local/bin/stranger --repeats-file /input/tests/input_data/variants.json /outputs/0.9.1/test.expansion-hunter.vcf > /outputs/0.9.1/test.expansion-hunter.stranger.vcf'.
2024-09-25 14:51:49 14c41598f761 stranger.cli[1292] INFO Running stranger version 0.9.1
2024-09-25 14:51:49 14c41598f761 stranger.cli[1292] INFO Parsing repeats file /input/tests/input_data/variants.json
2024-09-25 14:51:49 14c41598f761 stranger.utils[1292] WARNING Repeat number 1 (SNV_AND_STR) has multiple 'ReferenceRegion' but no 'PathologicRegion'. Skipping
..
2024-09-25 14:51:49 14c41598f761 stranger.utils[1292] WARNING No info for repeat id SNV_AND_STR_chr1:2000-2001
2024-09-25 14:51:49 14c41598f761 stranger.utils[1292] WARNING No info for repeat id SNV_AND_STR_chr1:2005-2008

Logs look same, but outputs not:

Input file:

##fileformat=VCFv4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=REPID,Number=1,Type=String,Description="Repeat identifier as specified in the variant catalog">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference length in bp">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit in the reference orientation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=VARID,Number=1,Type=String,Description="Variant identifier as specified in the variant catalog">
##FILTER=<ID=LowDepth,Description="The overall locus depth is below 10x or number of reads spanning one or both breakends is below 5">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Number of flanking reads consistent with the allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="Number of in-repeat reads consistent with the allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Number of spanning reads consistent with the allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LC,Number=1,Type=Float,Description="Locus coverage">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="Confidence interval for REPCN">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
##FORMAT=<ID=SO,Number=1,Type=String,Description="Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that the reads span, flank, or are fully contained in the repeat">
##ALT=<ID=STR10,Description="Allele comprised of 10 repeat units">
##ALT=<ID=STR2,Description="Allele comprised of 2 repeat units">
##ALT=<ID=STR4,Description="Allele comprised of 4 repeat units">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  test
chr1    2000    .   T   <STR4>  .   PASS    END=2001;REF=1;RL=1;RU=R;VARID=SNV_AND_STR_chr1:2000-2001;REPID=SNV_AND_STR_chr1:2000-2001  GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC 0/1:SPANNING/SPANNING:1/4:1-1/4-4:50/16:0/4:0/0:99.729730
chr1    2005    .   C   <STR2>,<STR10>  .   PASS    END=2008;REF=1;RL=3;RU=CAG;VARID=SNV_AND_STR_chr1:2005-2008;REPID=SNV_AND_STR_chr1:2005-2008    GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC 1/2:SPANNING/SPANNING:2/10:2-2/10-10:47/33:5/21:0/0:99.729730

Output 0.8.1:

##fileformat=VCFv4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=REPID,Number=1,Type=String,Description="Repeat identifier as specified in the variant catalog">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference length in bp">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit in the reference orientation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=VARID,Number=1,Type=String,Description="Variant identifier as specified in the variant catalog">
##FILTER=<ID=LowDepth,Description="The overall locus depth is below 10x or number of reads spanning one or both breakends is below 5">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Number of flanking reads consistent with the allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="Number of in-repeat reads consistent with the allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Number of spanning reads consistent with the allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LC,Number=1,Type=Float,Description="Locus coverage">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="Confidence interval for REPCN">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
##FORMAT=<ID=SO,Number=1,Type=String,Description="Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that the reads span, flank, or are fully contained in the repeat">
##ALT=<ID=STR10,Description="Allele comprised of 10 repeat units">
##ALT=<ID=STR2,Description="Allele comprised of 2 repeat units">
##ALT=<ID=STR4,Description="Allele comprised of 4 repeat units">
##INFO=<ID=STR_STATUS,Number=A,Type=String,Description="Repeat expansion status. Alternatives in [normal, pre_mutation, full_mutation]">
##INFO=<ID=STR_NORMAL_MAX,Number=1,Type=Integer,Description="Max number of repeats allowed to call as normal">
##INFO=<ID=STR_PATHOLOGIC_MIN,Number=1,Type=Integer,Description="Min number of repeats required to call as pathologic">
##INFO=<ID=SourceDisplay,Number=1,Type=String,Description="Source for variant definition, display">
##INFO=<ID=Source,Number=1,Type=String,Description="Source collection for variant definition">
##INFO=<ID=SourceId,Number=1,Type=String,Description="Source id for variant definition">
##INFO=<ID=SweGenMean,Number=1,Type=Float,Description="Average number of repeat unit copies in population">
##INFO=<ID=SweGenStd,Number=1,Type=Float,Description="Standard deviation of number of repeat unit copies in population">
##INFO=<ID=DisplayRU,Number=1,Type=String,Description="Display repeat unit familiar to clinician">
##INFO=<ID=InheritanceMode,Number=1,Type=String,Description="Main mode of inheritance for disorder">
##INFO=<ID=HGNCId,Number=1,Type=Integer,Description="HGNC gene id for associated disease gene">
##INFO=<ID=RankScore,Number=1,Type=String,Description="RankScore for variant in this family as family(str):score(int)">
##INFO=<ID=Disease,Number=1,Type=String,Description="Associated disorder">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  test
chr1    2000    .   T   <STR4>  .   PASS    END=2001;REF=1;RL=1;RU=R;VARID=SNV_AND_STR_chr1:2000-2001;REPID=SNV_AND_STR_chr1:2000-2001  GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC 0/1:SPANNING/SPANNING:1/4:1-1/4-4:50/16:0/4:0/0:99.729730
chr1    2005    .   C   <STR2>,<STR10>  .   PASS    END=2008;REF=1;RL=3;RU=CAG;VARID=SNV_AND_STR_chr1:2005-2008;REPID=SNV_AND_STR_chr1:2005-2008    GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC 1/2:SPANNING/SPANNING:2/10:2-2/10-10:47/33:5/21:0/0:99.729730

Output 0.9.1:

##fileformat=VCFv4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=REPID,Number=1,Type=String,Description="Repeat identifier as specified in the variant catalog">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference length in bp">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit in the reference orientation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=VARID,Number=1,Type=String,Description="Variant identifier as specified in the variant catalog">
##FILTER=<ID=LowDepth,Description="The overall locus depth is below 10x or number of reads spanning one or both breakends is below 5">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Number of flanking reads consistent with the allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="Number of in-repeat reads consistent with the allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Number of spanning reads consistent with the allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LC,Number=1,Type=Float,Description="Locus coverage">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="Confidence interval for REPCN">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
##FORMAT=<ID=SO,Number=1,Type=String,Description="Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that the reads span, flank, or are fully contained in the repeat">
##ALT=<ID=STR10,Description="Allele comprised of 10 repeat units">
##ALT=<ID=STR2,Description="Allele comprised of 2 repeat units">
##ALT=<ID=STR4,Description="Allele comprised of 4 repeat units">
##INFO=<ID=STR_STATUS,Number=A,Type=String,Description="Repeat expansion status. Alternatives in [normal, pre_mutation, full_mutation]">
##INFO=<ID=STR_NORMAL_MAX,Number=1,Type=Integer,Description="Max number of repeats allowed to call as normal">
##INFO=<ID=STR_PATHOLOGIC_MIN,Number=1,Type=Integer,Description="Min number of repeats required to call as pathologic">
##INFO=<ID=SourceDisplay,Number=1,Type=String,Description="Source for variant definition, display">
##INFO=<ID=Source,Number=1,Type=String,Description="Source collection for variant definition">
##INFO=<ID=SourceId,Number=1,Type=String,Description="Source id for variant definition">
##INFO=<ID=SweGenMean,Number=1,Type=Float,Description="Average number of repeat unit copies in population">
##INFO=<ID=SweGenStd,Number=1,Type=Float,Description="Standard deviation of number of repeat unit copies in population">
##INFO=<ID=DisplayRU,Number=1,Type=String,Description="Display repeat unit familiar to clinician">
##INFO=<ID=InheritanceMode,Number=1,Type=String,Description="Main mode of inheritance for disorder">
##INFO=<ID=HGNCId,Number=1,Type=Integer,Description="HGNC gene id for associated disease gene">
##INFO=<ID=RankScore,Number=1,Type=String,Description="RankScore for variant in this family as family(str):score(int)">
##INFO=<ID=Disease,Number=1,Type=String,Description="Associated disorder">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  test

Output 0.8.1 is same as input except from ##INFO= strings at the end of the header - annotation is unchanged. but output 0.9.1 has same header and lost all file content.

Is this intended? I'll try to find reason for this difference, but don't have time for full debug. Maybe it is due to get_repeat_id function introdution?

dnil commented 1 week ago

Hi there! Thank you for getting in touch! We did change up the repeat id parsing to accept TRGT style loci as well between those versions, so that is likely the issue here.

Just so I understand, what more exactly are you trying to do? It looks like you have a reference catalog file of your own; is it available somewhere? Are you intentionally specifying multiple regions but giving no pathological region?

Stikus commented 1 week ago

Oh, sorry, forgot to add variant catalog:

[
    {
        "LocusId": "SNV_AND_STR",
        "LocusStructure": "(R)*AATC(CAG)*",
        "ReferenceRegion": [
            "chr1:2000-2001",
            "chr1:2005-2008"
        ],
        "VariantType": [
            "Repeat",
            "Repeat"
        ],
        "NormalMax": "-",
        "Disease": "-",
        "PathologicMin": "-"
    }
]

This is ExpansionHunter test data from repo: https://github.com/Illumina/ExpansionHunter/blob/master/example/input/variants.json

https://github.com/Illumina/ExpansionHunter/tree/master/example/input - full input data

We have our combined tool, consists of ExpansionHunter, Stranger and REViewer - that's why we're using test data from EH as test data for all tool.

dnil commented 1 week ago

Thank you for the report! Expect the fix from current main in release 0.9.2.

Nice to see Stranger is in use. Where can one find your tool?

Stikus commented 1 week ago

Thanks for quick fix. Our tool is just a pipeline with ExpansionHunter, Stranger and REViewer in docker container and some bash script wrapper, nothing worth to be published :)