asan-nasa / adapt_find

Tool for identifying adapter sequences from single-end sequencing files
MIT License
3 stars 2 forks source link

[Errno 2] No such file or directory: 'file.fastq' when using the --input_path argument #4

Closed NicolasProvencher closed 7 months ago

NicolasProvencher commented 7 months ago

When specifying a input path, the path given to multiple steps of your script is only the filename which break the code cause it cant find the file in the adapt_find dir [Errno 2] No such file or directory: 'GSM1608268.fastq'

Line 981 -> i added args.input_path + "/" + before the f line 154 line that launch cutadapt etc

another thing i tried is adding os.chdir(cwd)

I am able to fix it myself for myself but since i am not familiar with the overall organisation of your script i though id bring this issue for you so that you can fix it in a way that wont break anything else

so you have a better understanding of whats happening, here is a tree of my setup adapt --adapt_find -----all the file of your git dir --test ----in --------fastq1 --------fastq2 ----out

here is the command line python adapt_find.py ILLUMINA --input_path path/to/adapt/test/in --output_path /path/to/adapt/test/out and i run the line from /path/to/adapt/adapt_find

NicolasProvencher commented 7 months ago

Another issue i run into is when i run the script on multiple files wether using the --input_path or --files argument, if multiple files are given and the list files is bigger than 1, the code is stuck forever after the first cutadapt file has been processed

Total number of input files = 2 Number of bigger files (>=10GB) = 0 Number of smaller files (<10 GB) = 2 number of CPU is 12 Processing 2 files processing file GSM1948925.fastq processing file GSM1608268.fastq Median length of aligned sequences for filename - GSM1948925 is 8.0 Writing BLAST output query subject aligned_seq adapter adapter_length aligned_length evalue qstart sstart 0 19733 4731 AAGCTAAG AAGCTAAG 8 8 1.4 20 24 1 13644 4731 AAGCTAAG AAGCTAAG 8 8 1.3 20 24 2 8250 4731 AAGCTAAG AAGCTAAG 8 8 1.5 20 24 3 12742 4731 AAGCTAAG AAGCTAAG 8 8 1.4 20 24 4 7057 4657 ACCTCGGGC ACCTCGGGC 9 9 0.4 17 20

Putative three prime end adapters for filename - GSM1948925 is ['GTATTAG', 'GCCAAAGC'] Three prime end adapter sequence for filename - GSM1948925 is GTATTAG Trimming with CUTADAPT

cutadapt -q 20 -m 15 -M 50 -a GTATTAG -o /home/noxatras/Desktop/adapt/test1/out/good-mapping/GSM1948925_trimmed.fastq /home/noxatras/Desktop/adapt/test1/in/GSM1948925.fastq > /home/noxatras/Desktop/adapt/test1/out/aux_files/GSM1948925/GSM1948925_cutadapt.txt [---------=8 ] 00:00:28 6,614,307 reads @ 4.3 µs/read; 13.84 M reads/minute len of new_lst is GSM1948925 1764746 len of filtered list is GSM1948925 1177940 Empty DataFrame Columns: [query, subject, aligned_seq, adapter, aligned_length, evalue, qstart, sstart] Index: []

///i used ctrl c here after waiting a good 30 minutes////

^CProcess ForkPoolWorker-1: Process ForkPoolWorker-7: Traceback (most recent call last): File "adapt_find.py", line 996, in Process ForkPoolWorker-6: result_list.append(pool.map(worker, [f for f in nasa ])) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 651, in get self.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 648, in wait self._event.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/threading.py", line 552, in wait Traceback (most recent call last): Process ForkPoolWorker-4: Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, *self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt Process ForkPoolWorker-5: Process ForkPoolWorker-3: signaled = self._cond.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/threading.py", line 296, in wait waiter.acquire() KeyboardInterrupt Traceback (most recent call last): Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt KeyboardInterrupt Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, *self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 352, in get res = self._reader.recv_bytes() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) KeyboardInterrupt

asan-nasa commented 7 months ago

When specifying a input path, the path given to multiple steps of your script is only the filename which break the code cause it cant find the file in the adapt_find dir [Errno 2] No such file or directory: 'GSM1608268.fastq'

Line 981 -> i added args.input_path + "/" + before the f line 154 line that launch cutadapt etc

another thing i tried is adding os.chdir(cwd)

I am able to fix it myself for myself but since i am not familiar with the overall organisation of your script i though id bring this issue for you so that you can fix it in a way that wont break anything else

so you have a better understanding of whats happening, here is a tree of my setup adapt --adapt_find -----all the file of your git dir --test ----in --------fastq1 --------fastq2 ----out

here is the command line python adapt_find.py ILLUMINA --input_path path/to/adapt/test/in --output_path /path/to/adapt/test/out and i run the line from /path/to/adapt/adapt_find

Hello. Thanks for pointing that out. I made some changes in December last year to fix the output path, unknowing to me it has created issues with input path, which i have fixed it now. Instead of adding args.input_path at several instances, i just made modifications to line no 77, so it works anywhere down the script. Please let me know if the latest version works for you with --input_path option.

asan-nasa commented 7 months ago

Another issue i run into is when i run the script on multiple files wether using the --input_path or --files argument, if multiple files are given and the list files is bigger than 1, the code is stuck forever after the first cutadapt file has been processed

Total number of input files = 2 Number of bigger files (>=10GB) = 0 Number of smaller files (<10 GB) = 2 number of CPU is 12 Processing 2 files processing file GSM1948925.fastq processing file GSM1608268.fastq Median length of aligned sequences for filename - GSM1948925 is 8.0 Writing BLAST output query subject aligned_seq adapter adapter_length aligned_length evalue qstart sstart 0 19733 4731 AAGCTAAG AAGCTAAG 8 8 1.4 20 24 1 13644 4731 AAGCTAAG AAGCTAAG 8 8 1.3 20 24 2 8250 4731 AAGCTAAG AAGCTAAG 8 8 1.5 20 24 3 12742 4731 AAGCTAAG AAGCTAAG 8 8 1.4 20 24 4 7057 4657 ACCTCGGGC ACCTCGGGC 9 9 0.4 17 20

Putative three prime end adapters for filename - GSM1948925 is ['GTATTAG', 'GCCAAAGC'] Three prime end adapter sequence for filename - GSM1948925 is GTATTAG Trimming with CUTADAPT

cutadapt -q 20 -m 15 -M 50 -a GTATTAG -o /home/noxatras/Desktop/adapt/test1/out/good-mapping/GSM1948925_trimmed.fastq /home/noxatras/Desktop/adapt/test1/in/GSM1948925.fastq > /home/noxatras/Desktop/adapt/test1/out/aux_files/GSM1948925/GSM1948925_cutadapt.txt [---------=8 ] 00:00:28 6,614,307 reads @ 4.3 µs/read; 13.84 M reads/minute len of new_lst is GSM1948925 1764746 len of filtered list is GSM1948925 1177940 Empty DataFrame Columns: [query, subject, aligned_seq, adapter, aligned_length, evalue, qstart, sstart] Index: []

///i used ctrl c here after waiting a good 30 minutes////

^CProcess ForkPoolWorker-1: Process ForkPoolWorker-7: Traceback (most recent call last): File "adapt_find.py", line 996, in Process ForkPoolWorker-6: result_list.append(pool.map(worker, [f for f in nasa ])) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 651, in get self.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 648, in wait self._event.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/threading.py", line 552, in wait Traceback (most recent call last): Process ForkPoolWorker-4: Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter*() KeyboardInterrupt Process ForkPoolWorker-5: Process ForkPoolWorker-3: signaled = self._cond.wait(timeout) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/threading.py", line 296, in wait waiter.acquire() KeyboardInterrupt Traceback (most recent call last): Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt KeyboardInterrupt Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter*() KeyboardInterrupt Traceback (most recent call last): File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(self._args, self._kwargs) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/queues.py", line 352, in get res = self._reader.recv_bytes() File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/noxatras/miniconda3/envs/adapt/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) KeyboardInterrupt

The problem is not with --input_path or --files argument. I tested them with my small RNA FASTQ files with the latest version of adapt_find from github and it works fine with multiple files. But based on your output log from your post, i presume what the problem might be. I could see that the FASTQ files you are using with GSM numbers correspond to these two entries (please check the GSM number after you open the link): https://www.ncbi.nlm.nih.gov/sra/?term=SRR1802156 and https://www.ncbi.nlm.nih.gov/sra/?term=SRR1802129 ? If it is so, those two entries correspond to ribosome RNA sequencing? adapt_find is trained only on sRNA datasets to identify adapter sequences, although it can also identify adapter sequences from single end ribosomal RNA sequencing files. However, adapt_find is only customized to find whether input FASTQ files are already adapter trimmed in sRNA libraries but not on single end ribosomal RNA sequencing. I assume the two FASTQ files are already adapter trimmed RNA sequencing files. When you provide adapter trimmed small RNA FASTQ file as input to adapt_find, based on the length distribution, adapt find will detect as adapter trimmed file. I have not included any such checkpoint for ribosomal RNA sequencing file. It looks like to me you have used an adapter trimmed ribosomal RNA sequencing file as input. Therefore, for your input FASTQ files, the script will assume that the files have adapters and will look for adapters by creating query and subject files using a pre-defined criteria. In such case, the common biological sequences was identified as adapters for the first FASTQ file. In the second file, i assume there was either no query and/or subject FASTA (no sequences after filtering) hence the resulting data frame is empty and as a result it hangs. There are checkpoints to check if a data frame is empty while processing the output from blast as data frame. But there is no checkpoint to check if the blast file is empty after reading the output blast file. I didn't have that checkpoint because hypothetically output from BLAST cannot be empty. My answer is based on the assumption that input FASTQ files are adapter trimmed rRNA sequencing FASTQ files. But if they are small RNA sequencing files, then can you please send the SRA link so that i can download and examine the issue? if it is ribosomal RNA sequencing as i assume, then i will make changes to detect whether the input ribosomal RNA sequencing file is adapter trimmed or not.

NicolasProvencher commented 7 months ago

For the input test I will try it tomorrow since I dont have my work computer with me right now

As for the 2nd issue i will confirm that I am trying to use adapt_find on ribosome profiling study, Since I am trying to re-analyse about 1.5k ribosome profiling files, I found your program pretty easy to include in my pipeline

In all my dataset, from a preliminary analysis, I seems to have both adapter reads and non-trimmed or partially trimmed read Since Ribosome profiling usually consist of read between 20 to 40 nu and a lot of file have a fixed 51 to 151 nu distribution for all their read i assumed adapters were left

Initially i was thinking about running your program on all my file whether or not they had adaptor left and and parse the cutadapt stat result to decide wether or not the adaptor found and the trimming done is acceptable to see wether i use the pretrim fastq or post trim fastq to align with star

I hope this gave you enough context If you can make it so it doesnt break between files i would be most grateful

Thank you for your time and help

NicolasProvencher commented 7 months ago

UPDATE I confirmed that your fix on the --input_path worked Also, I was running the code from a local laptop and it seems like the stopping is due to my system breaking more than the code because now i am able to run 3 files in a row and it break on the 2nd batch (because i have less processes running at the same time)

heres a console log: Total number of input files = 5 Number of bigger files (>=10GB) = 2 Number of smaller files (<10 GB) = 3 number of CPU is 12 Processing 3 files processing file GSM2779672.fastq processing file GSM1948925.fastq processing file GSM1608268.fastq is able to finish correctly this part

Working on bigger file(s) (file size greater than 10 GB) processing file GSM3168233.fastq processing file GSM2724036.fastq stall here with the same error message when i keyboard interrupt

to solve this problem I will set it up so it calls each file one by one

Since your code seem to utilize multithreading and im using cpu cluster to run it, any idea how many cpu core i should assign for a job?

For the issues sake I will consider it resolved for now