fasterius / VarClust

A Python package for clustering of single nucleotide variants from high-through seqencing data.
Other
5 stars 3 forks source link

varclust_create_profiles : KeyError: #4

Closed mikal16 closed 1 year ago

mikal16 commented 1 year ago

Hi there,

I'm trying to use varclust_create_profiles and I have followed the instructions as to make sure that my filename matches my sample name (minus the .vcf portion), however I am still getting an error that the files are not properly being found.

This is the error I am seeing (attached) and I have also included the vcf file (saved as txt because git doesn't allow me to attach .vcf files).

Any insight into what is going on here would greatly be appreciated.

Best,

Mikal

5_52_21_S1135_L003Aligned.sortedByCoord.out.bam.final.txt

Screen Shot 2023-06-06 at 1 15 04 AM
fasterius commented 1 year ago

Hi, Mikal!

Sorry to hear you're having issues with VarClust. Since you're getting a KeyError my first thought is that it is something do to with sample and file naming. I know you wrote that you checked that they are the same, but could you please show the header line with the sample name(s) in the VCF file?

mikal16 commented 1 year ago

Hey there!

Thanks so much for the quick reply! And yes, I completely agree, I think something in the naming is throwing the error and so maybe I'm sending a screen shot of my file here to show the naming. Essentially what I made sure to do was change the sample name in row 29 to match the one of the folder name (of course minus the .vcf part). Is this what I'm supposed to do?

Screen Shot 2023-06-07 at 10 55 35 AM

Thanks again! Mikal

fasterius commented 1 year ago

Hmm, yes, that looks right. Would you mind sending me a version of that VCF? Maybe just the header plus a couple of hundred lines, so I can test it out for myself?

mikal16 commented 1 year ago

Yes for sure, I'm going to save it as txt file since it won't let me attach a VCF, and send it over in here. Let me know if you'd also need me to email you directly the VCF file.

Thanks a ton for your help,

Mikal

I've attached below a complete version and a shortened version (attachments 1 and 2 respectively).

5_52_21_S1135_L003Aligned.sortedByCoord.out.bam.final.txt

5_52_21_S1135_L003Aligned.sortedByCoord.out.bam.final.txt

fasterius commented 1 year ago

Okay, after having checked the VCF files I can see that they are malformed, which I can actually see now in your screenshot as well, I just didn't spot it before. The FORMAT column just specified GT, while you'd normally have way more information (such as DP, AD, etc.), and your sample column is just zeroes.

You're getting the error because VarClust is trying to find information for your particular sample but finds nothing, since there's nothing there to find in your VCF.

It seems whatever you did to get to this VCF file has gone wrong, and I'd suggest you re-run whatever steps you did to try and get a properly formatted VCF. You can find more information about how a VCF file should be formatted here: https://samtools.github.io/hts-specs/VCFv4.2.pdf; you can also find an example VCF file at my other SNV-related R package seqCAT: https://github.com/fasterius/seqCAT/blob/master/inst/extdata/test.unannotated.vcf.gz.

I'm closing this issue as it doesn't have anything to do with VarClust, but please feel free to re-open if you still get issues after having fixed your VCF.