DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
184 stars 44 forks source link

Blobtools fail to calculate GC, coverage #126

Closed margaretc-ho closed 1 year ago

margaretc-ho commented 1 year ago

Hi all,

I have tried to troubleshoot this on my own to no avail, so I hope I can get your help on this. Blobtools was installed using git and it successfully runs the example data and generates good blob results for that. However, I am having trouble getting blobtools to work on my data. It runs fine without any error fail but it does not calculate GC or coverage from my contig fasta and PacBio CCS aligned bam files.

At first I thought my contig header names had issues (they first had pipe characters which I know SAMTools has issues with, so I changed to underscore and reran, then I thought maybe contig names were just too long so I changed the pipe characters to space, which generates short contig names similar to that in the example fasta -- since map2cov separates on whitespace) but I am still getting incorrect output. My contig names do not seem to be an issue.

This is how the output blobplots look blobplot blobplot read_cov cov0

This is part of the table.txt generated by blobtools image

This is also the stats.txt generated by blobtools image

As you can see, it does not calculate contig GC (which should just be from the FASTA, right?) and for whatever reason the cov0_read_map_p is calcuated as 0 even though it seems that it is able to recognize mapping reads from the bam.

I have checked that the sam/bam is intact with samtools quickcheck I have also run map2cov separately and it seems to recognize the bam just fine. Using blobtools create with this coverage file -c yields the same text files and blobplots lacking GC and proper calculation of coverage. These are the first few lines of the output of map2cov (.cov file) which looks normal to me image So I don't know why the calculation of cov0_read_map_p is 0 or why the % mapped reads is not showing up on the plot either.

I am stumped. Please help! This sequencing data is that of a protist and its endosymbiont(s) and we are trying to computationally isolate them. Thank you

DRL commented 1 year ago

Hi margaretc-ho,

I would guess something is wonky with the FASTA file. And maybe the lack of coverage comes from renaming the FASTA headers which then does not match the BAMs ... but I could be wrong...

If the problem persists, post how you ran the create command and the first few lines of the FASTA and BAM and maybe that way we can figure it out...

cheers,

dom

margaretc-ho commented 1 year ago

Hi Dom,

I have been checking everything and it is still not working. I am getting the same plots as above.

1) the contig fasta headers are as short as they can be ie. scaffold1, scaffold2, etc. 2) rerun minimap of the pacbio reads to the contigs.fa (exact same fasta as in (1) 3) rerun megablast on the contigs.fa (exact fasta as in (1)) The names of all the sequences look consistent, they are all scaffold1, scaffold2, etc.

This is the blobtools create command I am using and subsequent commands. GC coverage and read coverage are not calculated

module load samtools
blobtools nodesdb --nodes /data/homc/Utils/blobtools/data/nodes.dmp --names /data/homc/Utils/blobtools/data/names.dmp    
blobtools create -i t.contigs.k32.w100.tigmint-ntLink-arks.longstitch-scaffolds.filtered.short.fa -b Tminimap_PacBioCCS_to_PacBioCCS_5kbupLSw100k32.bam -t TcasCanuPacBioCCS_5kbupLSw100k32_filt.vs.nt.cul5.1e-25.np.megablast.out -o TCanuPacBioCCS_5kbupLSw100k32_filt
blobtools view -i TCanuPacBioCCS_5kbupLSw100k32_filt.blobDB.json
blobtools plot -i TCanuPacBioCCS_5kbupLSw100k32_filt.blobDB.json

What should I do?

margaretc-ho commented 1 year ago

I am happy to send over the full files for troubleshooting. Here I provide the first 1000 lines of the contigs, the megablast file, and the sam file and also the last 1000 lines of the sam file. I can also post the bam and bai file generated from the sam that I used with blobtools create, but figured the sam file might be more informative (given than the bam is a binary). I gave them all .txt extensions so I could upload to github.

head -n 1000 t.contigs.k32.w100.tigmint-ntLink-arks.longstitch-scaffolds.filtered.short.fa  > sample/contigs.fa.txt
head -n 1000 TCanuPacBioCCS_5kbupLSw100k32_filt.vs.nt.cul5.1e-25.np.megablast.out > sample/megablast.out.txt

head -n 1000 T_minimap_PacBioCCS_to_PacBioCCS_5kbupLSw100k32.sam > sample/head.sam.txt
tail -n 1000 T_minimap_PacBioCCS_to_PacBioCCS_5kbupLSw100k32.sam > sample/tail.sam.txt

contigs.fa.txt megablast.out.txt head.sam.txt tail.sam.txt

DRL commented 1 year ago

Hey,

This is very strange... the files seem fine as far as i can tell.

I suspect there might be something wrong with the blobtools installation. It is weird though that the example data work ...

Which version of python are you using?

Could you also install the dependencies using the conda commands? (Option A in in Installing dependencies)

cheers,

dom

margaretc-ho commented 1 year ago

The python on the HPC I am using is 2.7.15.

The installation instructions I followed were to do: git clone https://github.com/DRL/blobtools.git And install dependencies python setup.py install --user

I will try it in a conda environment. Is there a particular version of python I need to specify? Thanks

margaretc-ho commented 1 year ago

This time I tried it with conda installation I ran

git clone https://github.com/DRL/blobtools.git
conda create -n blobtools
conda activate blobtools
conda install -c anaconda matplotlib docopt tqdm wget pyyaml git
conda install -c bioconda pysam --update-deps

Python in this case is 3.10.8

Again, it works on sample data but not on my data :( Any ideas?

DRL commented 1 year ago

It will only work with python3 ... and always better to use conda for the dependencies.

Are you sure you are using the same blobtools installation for the sample data and your data?

You might have multiple installations of blobtools on the system.

Can you list the commands you use for sample data and your data?

margaretc-ho commented 1 year ago

Dom, you were 100% right. I did not realize that there was another copy of blobtools hiding in my local user bin folder and that was in my path, and I was calling that one which was DIFFERENT than one I ran on sample data bc I cd - ed into the correct install path (either both cases, pip and conda) and ran ./blobtools to access the example data. So in both install cases it was working fine, but there was a mystery blobtools (maybe an initial bad install that wasn't removed properly) that I mistakenly called bc it was the one in path.

IT WORKS and the plots are beautiful (our main organism is a lovely cluster of blobs) and we have several endosymbiont candidates now. Thank you for your patience in helping me work this out!