biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
26 stars 4 forks source link

Last record missing in barcode corrected BAM file #33

Open hukai916 opened 2 years ago

hukai916 commented 2 years ago

Hi developers,

I encountered a mysterious issue when using Pheniqs for barcode correction.

Basically, when performing the correction, I first prepare a CRAM file containing barcodes to be corrected and a json file according to the tutorial, then run Pheniqs. However, the resulting BAM files is one record less than the CRAM, and the missing one seems to be always the last record in the CRAM.

This happens only occasionally. For example, I run the same code for 6 times and capture one (bam3):

$ samtools view -c pbmc_500_10p_2_aa.corrected.bam
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam1
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam2
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam3
449999
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam4
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam5
450000

Not a big matter, but still annoying, I can provide more info if you would like to look into it.

Thanks,

--Kai

moonwatcher commented 2 years ago

Ill need more details. but is it possible one read fails decoding and the filter outgoing qc fail or filter incoming qc fail or is on?

hukai916 commented 2 years ago

What info do you need? I am attaching my test data and code here: https://www.dropbox.com/sh/6idqunobck72q5j/AAClfpkwhusMVDPNx6HU7wSYa?dl=0

Note that bam2 is missing the last record where bam1 and bam3 are correct. Let me know if you need other info. Thanks!

moonwatcher commented 2 years ago

@hukai916 so all bam files in the dropbox folder are outputs from the cram file? Can you please post the json config file you use and which version of pheniqs you are using?

log_decode2.txt seems to be the one showing a missing record. 450000 are reported in incoming and only 449999 are reported in sample. being the last record does point to some thread synchronization issue but obviously an extremely rare one since I have never encountered it. How many cores is this running on?

hukai916 commented 2 years ago

Hi, all bams are created from the same cram using the same pheniqs command. The missing-last-record cases occur occasionally, roughly 10%-20%, no matter how many cores I use. I will give more info after the holidays.

moonwatcher commented 2 years ago

Can you please post pbmc_500_5p_3_aa.json and I will try and replicate it.

L.

hukai916 commented 2 years ago

Hi L., I have uploaded the json file to the same folder. The Pheniqs I used is: pheniqs version 2.1.0 @moonwatcher

--Kai

moonwatcher commented 2 years ago

Hello Kai

Just a quick update., I have been trying to reproduce this but have so far been unsuccessful. I tried both with a build of pheniqs 2.1.0 and with head.

A side comment: using an extension of ".bam1" pheniqs does not detect that you actually wanted bam encoding as so it will revert to the default SAM output. This is why the files you provided are bigger than I expected. They are actually uncompressed, simple text in SAM format. You can override the default output format with --format bam.

Are you observing this on MacOS or Linux? What exact platform? might also be useful to know what dependencies. Did you build the binary yourself or installed it from Conda?

It is possible this is related to one of the dependencies. Since this is so tricky to replicate, will you be willing to test on custom build? I can provide instructions if you are interested.

hukai916 commented 2 years ago

Thanks L.,

I installed Pheniqs with conda on a Ubuntu Xenial container (Docker). The container contains solely miniconda, Pheniqs, pysam, and SAMTools. The container can be pulled by:

docker pull hukai916/pheniqs_xenial:0.2

I can perform more tests, pls share instructions. Thanks!

--Kai