EBI-Metagenomics / EukCC

Tool to estimate genome quality of microbial eukaryotes
GNU General Public License v3.0
31 stars 9 forks source link

'NoneType' object has no attribute 'query_alignment_length' when using merged bam file #32

Open mgabriell1 opened 2 years ago

mgabriell1 commented 2 years ago

Hi, Fist of all thanks for developing this tool! Unfortunately I'm having some issues merging eukaryotic bins using https://github.com/Finn-Lab/EukCC/blob/master/scripts/binlinks.py I have multiple samples which I have assembled, created bam files and binned. Now I'd like to check if it could be possible to merge some of the eukaryotic bins derived following this procedure (https://eukcc.readthedocs.io/en/latest/bin_merging.html#preparing-your-linked-reads). My boubts/question is that since I have multiple bam files I should first merge them and, afterwards, carry out the procedure described. I've tried that using samtools merge -o samples_all.sorted2.bam sample_*.sorted.bam -c but when I run binlinks.py I get the following error:

Traceback (most recent call last):
  File "./binlinks.py", line 216, in <module>
    main()
  File "./binlinks.py", line 167, in main
    if keep_read(read, cm, args.within, min_ANI=args.ANI) and keep_read(
  File "./binlinks.py", line 27, in keep_read
    / float(read.query_alignment_length) 
    * 100
AttributeError: 'NoneType' object has no attribute 'query_alignment_length'

I've tried with a bam file from every single sample and the script runs with no problems, but I guess that doing so wouldn't be optimal. I that correct? Also I would get n files with possible different linkages...

I've also tried concatenating the samples fastq files and then creating a bam file, but I still get the same error.

What could be the reason for this?

openpaul commented 2 years ago

Hello, the merging workflow is suppose to run on a per sample basis. So ideally you would only have a single bam file for each sample. I am not quite clear I understand, why do you have multiple bam files?

But irregardless, this seems like something that should work and the script should handle this more gracefully. I will try to solve this in a revised script if I can reproduce the issue

mgabriell1 commented 2 years ago

Sorry for not being clear. Let me explain again what I have done: First, I performed a coassembly using reads from several samples. Then, I created BAM files for each sample to perform binning. Now, I'd like to check wheter I could merge some bins and as I saw in the docs that EukCC requires only a single BAM file I thought that the best way to estimate the linkage was using the a BAM file in which all the samples are included. So I created one using the two methods i described above (merging already created BAM files or creating one from concatenated samples reads), but this gave the following error.

On a partially related node, on one page on the docs it is written that EukCC when run in folder mode will automatically refine the bins (https://eukcc.readthedocs.io/en/latest/quickstart.html#eukcc-on-a-folder-of-bins), while on the other it says that to merge bins it requires to estimate the linkages using binlinks.py (https://eukcc.readthedocs.io/en/latest/bin_merging.html). What is the difference?

openpaul commented 2 years ago

Thank you for elaborating. Yes in that case merging the MAG files sounds like a good idea.

It seems that due to the merging you do loose the attribute query_alignment_length.

To sidestep the issue you can concatenate the fastq files and produce a new bam file by aligning all reads again. This would probably work. But I understand that merging the BAM files would be faster and easier for you.

I will have to test samtools merge myself to see what happens there.

openpaul commented 2 years ago

An with regards of the documentation, thank you for pointing this out. I will make these sections more clear.

Linked reads are not required for bin merging, but recommended. So the algorithm will work with or without but is much faster and much more confident with the link table.

mgabriell1 commented 2 years ago

I've tried doing that, but I encountered the same error. I don't know if it was an error on my side, but I followed the same script I used to create the single BAMs and those worked.

Ok, now the difference is perfectly clear!

openpaul commented 2 years ago

Oh I see, are you able to share the bam file with me for testing? That would make testing easy for me. Of course the data would be confidential.

Otherwise I will create a workaround to enable you to report the problem read and ignore the error.

openpaul commented 2 years ago

I created such a debug script here https://github.com/Finn-Lab/EukCC/blob/issue32/scripts/binlinks.py You can provide me with the output so I have information to fix the issue.

It should run through and produce a usable file, but I did not test it. So it might also not

mgabriell1 commented 2 years ago

Which bam? The one created from the single files or the merged one? In any case, if you share your email I can send you a link with it

openpaul commented 2 years ago

Send it to saary@ebi.ac.uk The one that has the issue. The merged on I assume.

openpaul commented 2 years ago

Hello,

thank you for providing the file. It seems to me that the read with the name euk_Picpa1_chromosome_1-4634 occurs three times in the bam file with the flags 121, 99 and 147, instead of twice as its usual for paired-reads.

As far as I understand the SAM format, this seems odd, as no read is marked as supplementary.

Do you know if this is also the case in the individual bam files before merging? You can check for this using a simple samtools view *.bam | grep euk_Picpa1_chromosome_1-4634

I will have too have a closer look into this.

mgabriell1 commented 2 years ago

Hi, I've checked and yes one bam file presents that red with both 99 and 147, while another one shows the read with 121

openpaul commented 2 years ago

In my experience, the read ID should be unique to a single paired read. Otherwise how can we construct the relation between paired reads. After reconstructing the fastq files, in your case the same read IDs seems to have been used twice or sometimes even 3 times.

Is this maybe simulated data? Make sure that simulated datasets have unique keys for each read.

I could be wrong though. Let me know.

mgabriell1 commented 2 years ago

Yes, that is simulated data that I got using CAMISIM. I've checked the read files and those present that read just once, but the duplication appears in the sam file obtained using bwa mem -t 16 "$sample"_QC_R1.fq "$sample"_QC_R2.fq > /"$sample".sam. Maybe some error occurred during mapping, but I've checked the logfiles and nothing comes up.. In any case thanks for the help!

openpaul commented 2 years ago

Ok,

so to me this sounds like maybe the script should accept multiple BAM files. Then we could avoid merged BAM files and thus duplicate read names when using simulated data.

I will post here, once I have adapted the script for this.

mgabriell1 commented 2 years ago

Yes, that would probably be the best solution, as also someone wouldn't have to spend time merging the files. Thanks!

javiercnav commented 2 years ago

I want to follow up on this post. Even though the read docs (https://eukcc.readthedocs.io/en/latest/bin_merging.html#preparing-your-linked-reads) says it can take multiple bams by using *.bam; the script does not work for this. Error: binlinks.py: error: unrecognized arguments:

openpaul commented 1 year ago

Thats unfortunate. I will look into it. Thank you for mentioning it.

gjordaopiedade commented 1 year ago

I want to follow up on this post. Even though the read docs (https://eukcc.readthedocs.io/en/latest/bin_merging.html#preparing-your-linked-reads) says it can take multiple bams by using *.bam; the script does not work for this. Error: binlinks.py: error: unrecognized arguments:

I solved this issue by cloning the repository and running that binlinks.py in the EukCC conda env conda create -y -n eukrep-env -c bioconda "eukcc>=2"

It seems that there is some issue with the singularity module and that the version of this script in the conda repository is not the latest.