Duplicated Key error - Githubissues

JKuroYama commented 2 weeks ago

Hi mate!,

Love your program it has become my default to go for clustering and consensus.

I have been using it with the error below sporadically poping sometimes, but recently it is happening every time. Therefore I wonder if you could help me to figure out how can I get around this problem. I am suing WLS-2, wonder if this will be the problem?

I am very grateful for any advise you can provide.

All the best!, -------------------------------- // Error copy //------------------------------- Writing sequences with high similarity in separate files Traceback (most recent call last): File "/home/nano/nano_bin/amplicon_sorter_2024-02-20.py", line 2081, in sort_groups() File "/home/nano/nano_bin/amplicon_sorter_2024-02-20.py", line 2016, in sort_groups sort(name) File "/home/nano/nano_bin/amplicon_sorter_2024-02-20.py", line 2044, in sort filter_seq(group_filename, grouplist, indexes) File "/home/nano/nano_bin/amplicon_sorter_2024-02-20.py", line 1451, in filter_seq record_dict = SeqIO.index(os.path.join(infolder, infile), 'fastq') File "/home/nano/miniforge3/lib/python3.10/site-packages/Bio/SeqIO/init.py", line 886, in index return _IndexedSeqFileDict(random_access_proxy, key_function, repr, "SeqRecord") File "/home/nano/miniforge3/lib/python3.10/site-packages/Bio/File.py", line 203, in init raise ValueError(f"Duplicate key '{key}'") ValueError: Duplicate key '93aacf80-eb76-4d1c-bf74-ca12d6d07c12' ------------------------------------------// END //----------------------------

avierstr commented 2 weeks ago

Hi JKuroYama, is that reproducible ? Can you send me a datafile that is causing this problem ? The error message is suggesting there are 2 reads in your datafile with the same name. Greets, Andy

JKuroYama commented 2 weeks ago

Hi Andy,

My name is Javier. Thank you very much for your reply. I try today on a fresh WSL2-Ubuntu install, but I have been getting two consistent errors as described below when using the attached dataset.

EXP-PBC096_barcode01.filtered.fastq.gz

I am very grateful for all your time and help,

Best,

Case #1 $ amplicon_sorter.py -i EXP-PBC096_barcode01.filtered.fastq -ss 97 -sc 98 -min 5000 -max 6000 -ar -np 1 -maxr 100000 -sfq

--------------// ERROR 2 //---------------------- Writing sequences with high similarity in separate files Traceback (most recent call last): File "/home/nano/nanobin/amplicon_sorter.py", line 2081, in sort_groups() File "/home/nano/nanobin/amplicon_sorter.py", line 2016, in sort_groups sort(name) File "/home/nano/nanobin/amplicon_sorter.py", line 2044, in sort filter_seq(group_filename, grouplist, indexes) File "/home/nano/nanobin/amplicon_sorter.py", line 1451, in filter_seq record_dict = SeqIO.index(os.path.join(infolder, infile), 'fastq') File "/home/nano/mambaforge/lib/python3.10/site-packages/Bio/SeqIO/init.py", line 886, in index return _IndexedSeqFileDict(random_access_proxy, key_function, repr, "SeqRecord") File "/home/nano/mambaforge/lib/python3.10/site-packages/Bio/File.py", line 203, in init raise ValueError(f"Duplicate key '{key}'") ValueError: Duplicate key '7c4ce1c8-faad-4c79-8706-0b053b03becb' --------------// END ERROR 1 //----------------------

Case #2

$ amplicon_sorter.py -i EXP-PBC096_barcode01.filtered.fastq -ss 97 -sc 98 -min 5000 -max 6000 -ar -np 10 -maxr 100000 -sfq

--------------// ERROR 2 //---------------------- Exception in thread Thread-2 (feeder): Traceback (most recent call last): File "/home/nano/mambaforge/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/nano/mambaforge/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/nano/nanobin/amplicon_sorter.py", line 723, in feeder filenames.sort(key=lambda x: os.path.getmtime(os.path.join( File "/home/nano/nanobin/amplicon_sorter.py", line 723, in filenames.sort(key=lambda x: os.path.getmtime(os.path.join( File "/home/nano/mambaforge/lib/python3.10/genericpath.py", line 55, in getmtime return os.stat(filename).st_mtime FileNotFoundError: [Errno 2] No such file or directory: '/mnt/d/KANSO_nanoDG/workSeq/for-Javi-pipe/n18S28Slib1/n18S28Slib1_nanoFilter/5k-6k/EXP-PBC096_barcode01.filtered/file_2.todo' --------------// END ERROR 2 //----------------------

JKuroYama commented 2 weeks ago

You are right there are some duplicated reads for some reason, which solves the "Duplicate key" error, but I am lost on "Exception in thread Thread-2 (feeder):"

Thanks ;)

avierstr commented 2 weeks ago

Hi Javier, I'm getting the same error about the "Duplicate key". What is the reason that you have duplicate reads ? Is this from a incorrect working program before you put your samples in amplicon_sorter ? This should not be the case, but if necessary I can try to "catch" that error so that the program continues.

Your case 2 (FileNotFoundError) is probably because you are working on an external drive ( '/mnt/d/KANSO_nanoDG/...). I have noticed there is a small delay in writing files on external disks, clouddrives like Onedrive, Google drive and so. It wants to read the file that is not written completely and that causes the error. Can you try your analyses on your internal harddisk ?

Best regards, Andy

JKuroYama commented 2 weeks ago

Hi Andy,

The duplication is because these reads come from duplex library prep. However, these files were custom to include simplex reads as well (after base calling and Q control) aiming to minimise data waist.

Therefore to keep track of which reads were duplex and which were simplex, the IDs were used to keep track; same IDs for duplex pairs, unique IDs for simplex.

Perhaps we can add something like If "x" key exits append "-d" AND "send "x" key to "log-duplicated-keys.txt". In my case should have <10% of duplicated reads IDs per .fq file.

The puzzling thing is that "amplicon_sorter" was working fine a couple of months ago using the same "EXP-PBC096_barcode01.filtered.fastq.gz" file, the duplicated key error has only become consistent now that I was trying to reconstruct the clusters including the option "-sfq" and removing "-ra".

Best Regards,

Javier Montenegro

On Fri, Jun 21, 2024 at 12:32 AM avierstr @.***> wrote:

Hi Javier, I'm getting the same error about the "Duplicate key". What is the reason that you have duplicate reads ? Is this from a incorrect working program before you put your samples in amplicon_sorter ? This should not be the case, but if necessary I can try to "catch" that error so that the program continues.

Your case 2 (FileNotFoundError) is probably because you are working on an external drive ( '/mnt/d/KANSO_nanoDG/...). I have noticed there is a small delay in writing files on external disks, clouddrives like Onedrive, Google drive and so. It wants to read the file that is not written completely and that causes the error. Can you try your analyses on your internal harddisk ?

Best regards, Andy

— Reply to this email directly, view it on GitHub https://github.com/avierstr/amplicon_sorter/issues/21#issuecomment-2181103321, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJJRGD3Z4F7W3J35L76JNMLZIL72LAVCNFSM6AAAAABJRCFBVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBRGEYDGMZSGE . You are receiving this because you authored the thread.Message ID: @.***>

avierstr commented 1 week ago

Hi Javier,

when looking at the code, I noticed that when you use the option "-sfq" that it indexes the input file because it has to find the original reads with the Q values. When saving to fasta (default), it has all the information in memory (readname and sequence). That is causing the error. Because it is indexed with Biopython SeqIO.index, I can not catch that error.

Is it necessary that your fastq output files still have identical ID's for duplex pairs ? Is Nanopore producing those identical ID's or do you make them identical in some way ? I can rename the reads before I index them and if necessary rename them back afterwards. But if there are identical ID's, does that not cause errors in other programs afterwards ?

Best regards, Andy

JKuroYama commented 1 week ago

Hi Andy,

No Nanopore will not produce these duplicated IDs, but a list of ID pairs, and they are unnecessary; it was just a way to keep track of things, but in practice, it does not make any difference for downstream analyses.

I think my situation is very unique, so perhaps it will be best if I write a short script to append a sequential number to the IDs of all sequences, in this way, they will become unique in a traceable way. This should allow me to use amplicon_sorter without issues. Thanks!

I am grateful for all your help and time!

avierstr / amplicon_sorter

Duplicated Key error #21