ERROR with test run (hg19.STRdecoys.sorted.fasta.sa missing from test dataset)

Oshlack / STRetch

Method for detecting STR expansions from short-read sequencing data

MIT License

62 stars 15 forks source link

ERROR with test run (hg19.STRdecoys.sorted.fasta.sa missing from test dataset) #67

Closed hchetia closed 2 years ago

hchetia commented 3 years ago

Hi, I am trying to test run STRetch using the dataset at https://ndownloader.figshare.com/articles/4762489?private_link=cc7347f4637d9a7fe22d and running into the foll. error. (PFA). Basically, the tool looks for a file "hg19.STRdecoys.sorted.fasta.sa" which is not a part of the test dataset.

stretch_error.txt

hdashnow commented 3 years ago

Hi @hchetia, I've traced back this issue to potentially being an issue at Figshare's end. I've contacted them, and will let you know when I have a fix.

hchetia commented 3 years ago

Do you recommend going ahead with my actual runs rather than to wait out the test dataset's availability?

hdashnow commented 3 years ago

The error was caused by missing reference genome files. So they will be needed for a regular run.

You can download individual missing files from here in the meantime:

hg19 https://figshare.com/articles/dataset/STRetch_reference_data_-_hg19/4658701/1

hg38 https://figshare.com/articles/dataset/STRetch_reference_data_-_hg38/5844396

hdashnow commented 2 years ago

FigShare wasn't getting back to me about this error, so I've moved the data. Would you mind updated with git pull and run the ./install.sh again. Then let me know if the test data works.

hchetia commented 2 years ago

Hi @hdashnow I already ran my data using the files you shared above. Worked fine for me. Thanks.

hdashnow commented 2 years ago

Great!

hchetia commented 2 years ago

Hi @hdashnow Thanks for STRETCH. Love the concept of decoy chromosomes. Do you happen to have a reference hg38 fasta file with the repeats introduced within the genes and their corresponding updated annotations? Adding 2000 trinucleotide repeats would add 6000 residues to the downstream annotation values right? Any gtf or gff3 format would work. This genome would be really helpful in visualizing the reads under IGV.

Regards, Hasna

hdashnow commented 2 years ago

I don't have anything like that. I do have some code for generating fasta files with different STR alleles. I used it for simulating reads. But you could potentially use similar logic to create an alternate reference. https://github.com/quinlan-lab/STRling/blob/master/sim/random_str_alleles.py I think that visualizing in IGV will still be challenging, because of the anchored reads. When you look at the reads aligned to the STRetch genome + decoy, these anchored reads should show up in a different colour because they align to a different chromosome (the decoy). This can help with visualization.

hchetia commented 2 years ago

Awesome. Thanks, will try and get back to you.