franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
188 stars 40 forks source link

abundance | samtools view: failed to add PG line to the header #164

Open kunaljaani opened 2 months ago

kunaljaani commented 2 months ago

Hi! Francisco While running the -t abundance I am running into a problem with the sam to bam conversation due to a duplicate header. When I checked the grep "NODE_1753_length_683_cov_1.602310" FDSGM6.sam -c I found more than one occurrence.

Converting SAM to BAM with samtools view ... [W::sam_hdr_create] Duplicated sequence "NODE_1753_length_683_cov_1.602310" in file "FDSGM6.sam" [E::sam_hrecs_update_hashes] Duplicate entry "NODE_1753_length_683_cov_1.602310" in sam header samtools view: failed to add PG line to the header

I am also surprised to see that 2 of 35 samples ran fine. and if there was some issue it would have created problems for all the samples. So I am not able to figure out the issue. Could you please suggest a solution?

Thank you Kunal

franciscozorrilla commented 2 months ago

Hi Kunal, thanks again for reporting this issue and apologies for my delay in responding.

I believe that this problem occurs due to the fact that contig headers are not guaranteed to be unique across MAGs. It is unlikely, but evidently it is possible that two different contigs are assembled with the same node_length_coverage values across different genomes from the same sample, in which case you will see those errors you have reported. I will have to think about the best way to deal with this in the future, but for now you can simply add a unique MAG ID at the start of each contig header when concatenating MAGs to avoid this issue.

https://github.com/franciscozorrilla/metaGEM/blob/57cfd60485da82438faadcf75edc8623ac822196/workflow/Snakefile#L1066-L1067

Instead of using cat, use a loop to append MAG ID at the start of each contig e.g. assuming your mags for a given sample are in subfolder mags with .fa extension:

ls mags|grep ".fa"|while read genome;do paste mags/$genome| sed "s/>/>${genome%.*}_/g";done > sample.fa

Best wishes, Francisco

kunaljaani commented 2 months ago

Hi Francisco,

Thank you very much for your reply and the detailed explanation. Thanks a lot Kunal