Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

Building variant catalog from CRAM file error #199

Open roel1289 opened 2 days ago

roel1289 commented 2 days ago

Hello,

I am currently working on developing a variant catalog that contains every location of the (GGGAGA)* repeat across the whole human genome (a couple thousand locations).

To make this I am using CRAM files (aligned to hg38), and I am keeping track of the location of each GGGAGA repeat using string searches. I am then using these locations to produce a variant catalog. Here is an example of a shortened variant catalog I have made:

[
    {
        "LocusId": "SVA",
        "LocusStructure": "(GGGAGA)*",
        "ReferenceRegion": [
            "chr2:32916460-32916478",
            "chr12:48501530-48501548"
        ],

        "VariantType": "Repeat"
    }
]

Except I keep getting this error: 2024-11-20T13:41:25,[Error loading locus SVA: Locus SVA must specify reference regions for 1 variants]
How can I make a variant catalog like this without knowing the reference region from the reference genome?

Thanks! Ross

andreasssh commented 2 days ago

Hi there,

You should create a new entry for each locus, e.g.:

[
    {
        "LocusId": "locus_id_for_this_region",
        "LocusStructure": "(GGGAGA)*",
        "ReferenceRegion": "chr2:32916460-32916478",
        "VariantType": "Repeat"
    },
    {
        "LocusId": "and_locus_id_for_this_region",
        "LocusStructure": "(GGGAGA)*",
        "ReferenceRegion": "chr12:48501530-48501548",
        "VariantType": "Repeat"
    }
]
roel1289 commented 1 day ago

Thank you for the help. Is there any way to get around this error: [Error loading locus SVA: Flanks can contain at most 5 characters N but found 985 Ns]?

It seems that some people have made some people have made changes to get rid of this warning (https://github.com/bw2/ExpansionHunter). What is the best route to fix this error?

Thanks!

andreasssh commented 3 hours ago

Yep, the bw2 version and my version (https://gitlab.com/andreassh/ExpansionHunter) both can skip those loci with the error and continue analysis instead of terminating it like the original version does. It happens because the locus is close to a chromosome edge or in unsequenced parts of the reference genome.