Closed laneatmore closed 4 months ago
Hi,
To give a bit of background, this check is in place to catch cases like when you accidentally include multiple copies of the same FASTQ files in your makefile or when people actually upload multiple copies of the same data to repositories. So not PCR duplicates, but 1:1 copies of FASTQ files. Picard will sometimes, but not always detect this problem, depending on where the duplicate date is located and the read type, which is why I added a separate check.
Both of these are situations I've run into and is typically a sign of somebody having misorganized data. In the first case you'll want to review what files you are using and at a minimum make sure that you are not including the same data multiple times. In the second case, you typically need to contact the original creators of the data and ask them to re-upload their data.
However, the check does have false positives and you may have run into one such instance. The paleomix dupcheck
command is supposed to help you figure that out by giving you a more detailed report, but I appear to accidentally have made it inaccessible. Thanks for letting me know, I'll fix that! In the mean time you can run the dupcheck
command directly via the command python3 -m paleomix.tools.dupcheck
. If you have installed paleomix in a virtual or conda environment, then you need to activate it first.
That will print every identical set of reads and give you a total number as well. If you only see a small number of identical reads, then they are probably just false positives. In that case you can just run the touch
commands shown in the error messages to skip the checks. But if a significant portion of the data is duplicated, then it could indicate a problem with the dataset as described above.
Best, Mikkel
Hi Mikkel,
Thanks so much for the detailed reply. With dupcheck I can see that it appears to be 50-100 duplicates per individual. Not a huge proportion of the data, but it is consistent across all samples. I've re-downloaded all the data and checked my Makefiles, and it seems like it's a problem with the original upload. I'll try to get that sorted out.
For the time being, the "touch" command to get around the error does seem to be working, thanks!
Best, Lane
Hi Lane,
If it is just a few hundred reads per sample then I wouldn't worry about it.
I don't know how or why it happens, but I have previously observed completely identical mates occurring at a (very) low rate in some datasets, as seems to be the case here judging by the first error you shared. My guess is that this is a problem that occurs during sequencing or base-calling.
If so then I'm not sure you could even do anything about it, except perhaps filter the duplicate sequences. But given the low numbers it is probably not worthwhile to do so.
Best, Mikkel
Hi, I have been using paleomix for a long time (thanks for this program!) and am now trying to use it on some data I downloaded from SRA. Every time I run the pipeline I get an error that there are duplicated reads in my bam files and am prompted to run paleomix dupcheck:
When I try to run dupcheck, paleomix says it can't find the command. I've tried checking duplicated reads in the fastq files with fastQC and seqkit and deduplicating with czid-dedup but it hasn't made a difference. I've also tried setting PCRDuplicates: to 'mark' instead of 'filter' but still get the same results.
Following this page, I would expect there to be a problem with library merging, but I only have one paired set of fastq files so I think this is quite unlikely.
I've independently run Picard MarkDuplicates on M-Toku_1_1.rmdup.normal.bam output from paleomix and the output was as follows:
I'm trying to sort out if there is a paleomix issue, something particular about this data (e.g., possible polyploidy), or if the files are somehow corrupted. Is dupcheck still a viable command? Or are there other reasons I might be seeing this error? How does paleomix deal with polyploidy?
Thanks!
My makefile is copied here: