edgardomortiz / sam2consensus

Get the consensus sequences from SAM (and BAM) files, ignoring the reference
GNU General Public License v3.0
10 stars 4 forks source link

KeyError #1

Open steinbrl opened 2 years ago

steinbrl commented 2 years ago

Hi,

I tried to pull an consensus from an Minimap2 Alignment. sam2consensus stops with the following error message: lars@helmut:~/Desktop/VZV/Carina_09_2021/VZV_ORF14_cmv/assembly-test/Scaffold$ python /home/lars/Desktop/NGS/sam2consensus-master/sam2consensus.py -f N -p final -i scaffold_mapping.sam

Processing file scaffold_mapping.sam:

SAM header processed, 1 references found.

0 reads processed. Traceback (most recent call last): File "/home/lars/Desktop/NGS/sam2consensus-master/sam2consensus.py", line 430, in main() File "/home/lars/Desktop/NGS/sam2consensus-master/sam2consensus.py", line 212, in main sequences[refname][pos_ref][nuc] += 1 KeyError: '*'

The .sam file seemed to be correct. There are no issues while importing and visualize it in CLC or Geneious.

Best,

Lars

edgardomortiz commented 2 years ago

Hi Lars,

Would you mind sharing a few thousand lines of your SAM file?, I haven't tried SAMs from minimap yet. It might be something related to that. You can email me directly if you don't want to share your file here in GitHub.

Edgardo

steinbrl commented 2 years ago

Wow, very fast reply :D I can share the whole file, it is not that big. It is the product of an reference based scaffolding step. So, it contains orientated contigs. scaffold_mapping.sam.zip

edgardomortiz commented 2 years ago

Thanks, I will take a look and get back in the new few days (pretty busy with work too)

DenisTEMPE commented 2 years ago

Hi Edgardo, hi Lars, I'm having the same kind of error while processing a locus in chr7 of sam files from either tophat2 or STAR. Thank you in advance, Denis

Processing file locus_accepted_hits.sam:

SAM header processed, 0 references found.

Traceback (most recent call last): File "sam2consensus.py", line 436, in main() File "sam2consensus.py", line 223, in main sequences[refname][pos_ref][nuc] += 1 KeyError: 'chr7'

locus_accepted_hits.zip

edgardomortiz commented 2 years ago

Sorry , it took me longer than I expected to check on your files. @steinbrl sam2consensus needs a SAM with full header information, also the SAM you sent doesn't have any CIGAR strings, perhaps you could provide the command you used to create this SAM? @DenisTEMPE your case is easier to solve, assuming these are the first lines of the SAM you got, perhaps you are obtaining the file through samtools view which without additional options skips the header, just add -h to your samtools view command to include the header in the output.

Edgardo

DenisTEMPE commented 2 years ago

Hi Edgardo, In ran sam2consensus with headers in the sam file. The job runs endless after "Start processing..." and is finally aborted :

Start processing...

Segmentation fault (core dumped) locus.sam.txt

steinbrl commented 2 years ago

Hey Edgardo,

I used Minimap2 for mapping. I switched now to bcftools to pull the consensus. It is only to scaffold the draftgenome out of contigs, after the denovo assembly. I found an workaround to do so…

Thank you for your effort!

Best regards,

Lars

-- Lars Steinbrück Medizinische Hochschule Hannover Institut für Virologie OE 5230 AG Schulz/NGS/Genomics Geb. I6, Ebene 6, Raum 2060 Carl-Neuberg-Straße 1 30625 Hannover Tel.: ++49 (0)511 532 4313 Fax.: ++49 (0)511 532 8736

Von: Edgardo M. Ortiz @.> Gesendet: Mittwoch, 10. November 2021 09:14 An: edgardomortiz/sam2consensus @.> Cc: Steinbrück, Lars @.>; Mention @.> Betreff: Re: [edgardomortiz/sam2consensus] KeyError (#1)

Sorry , it took me longer than I expected to check on your files. @steinbrlhttps://github.com/steinbrl sam2consensus needs a SAM with full header information, also the SAM you sent doesn't have any CIGAR strings, perhaps you could provide the command you used to create this SAM? @DenisTEMPEhttps://github.com/DenisTEMPE your case is easier to solve, assuming these are the first lines of the SAM you got, perhaps you are obtaining the file through samtools view which without additional options skips the header, just add -h to your samtools view command to include the header in the output.

Edgardo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/edgardomortiz/sam2consensus/issues/1#issuecomment-964880420, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOAEBCFMMX22LRFOV2PAFMDULISTLANCNFSM5F7TCRNQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

edgardomortiz commented 2 years ago

Hi @DenisTEMPE ,

After a few tests I found the reason for the failure. My script creates empty containers for the reference sequences which get filled with nucleotides sums as the SAM is processed, these take around 10 bytes per nucleotide in the reference. So, given a reference as long as the one you have the script quickly fills up the RAM and get the process killed. My code is impractical for large references, it can only work well with references of at most 10Mbp, when I wrote the program I was working with small organellar genomes.

Perhaps I will attempt to rewrite the code for large references, but I can't promise when. Sorry, you will have to look for another solution.

@steinbrl thank for the info, that makes perfect sense for your case. If the consensus calculated by bcftools is reference-agnostic it could help @DenisTEMPE too.

Edgardo