MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

Error with trimming SE adapters from sample #52

Closed Uriwolkow closed 8 months ago

Uriwolkow commented 8 months ago

Hello! I ran into a node running error with a sample I'm working with. I would add that this sample is different form my samples of interest, who are PE - mine seem to be working fine through the pipeline, but the results look suspiciously empty, as if there is no match at all between sample and reference. This is why I elected to run another sample which was already mapped in other methods, not using the pipeline, to check if the issue is with my data or in my makefile

On my first attempt, the STDERR file includes this text:

Trimming single ended reads ...
Opening FASTQ file 'Vole_test_raw/MM1000.fastq.gz', line numbers start at 1

Processed 1,000,285 reads in 14.0s; 71,000 reads per second ...
Processed 2,001,875 reads in 27.4s; 72,000 reads per second ...
Processed 3,003,467 reads in 40.4s; 74,000 reads per second ...
Processed 4,005,354 reads in 53.9s; 74,000 reads per second ...
Processed 5,007,149 reads in 1:07.8s; 73,000 reads per second ...
Processed 6,008,780 reads in 1:20.8s; 74,000 reads per second ...
Processed 7,009,170 reads in 1:34.3s; 74,000 reads per second ...
Processed 8,010,903 reads in 1:48.1s; 74,000 reads per second ...
Processed 9,012,598 reads in 2:01.9s; 73,000 reads per second ...
Processed 10,014,250 reads in 2:15.1s; 74,000 reads per second ...
Processed 11,015,965 reads in 2:28.8s; 74,000 reads per second ...
Processed 12,017,599 reads in 2:42.0s; 74,000 reads per second ...
Processed 13,017,802 reads in 2:55.2s; 74,000 reads per second ...
Processed 14,019,601 reads in 3:09.4s; 73,000 reads per second ...
Processed 15,021,310 reads in 3:22.6s; 74,000 reads per second ...
Processed 16,022,832 reads in 3:35.7s; 74,000 reads per second ...
Processed 17,024,547 reads in 3:49.1s; 74,000 reads per second ...
Processed 18,026,280 reads in 4:03.0s; 74,000 reads per second ...
Processed 19,027,986 reads in 4:16.3s; 74,000 reads per second ...
Processed 20,028,236 reads in 4:29.8s; 74,000 reads per second ...
Processed 21,029,958 reads in 4:43.6s; 74,000 reads per second ...
Processed 22,031,616 reads in 4:56.7s; 74,000 reads per second ...
Processed 23,033,374 reads in 5:10.4s; 74,000 reads per second ...
Processed 24,035,170 reads in 5:23.1s; 74,000 reads per second ...
Processed 25,035,579 reads in 5:37.2s; 74,000 reads per second ...
Processed 26,037,423 reads in 5:50.7s; 74,000 reads per second ...
Processed 27,039,254 reads in 6:04.8s; 73,000 reads per second ...
Processed 28,040,910 reads in 6:17.7s; 74,000 reads per second ...
Processed 29,041,372 reads in 6:31.9s; 73,000 reads per second ...
Processed 30,043,188 reads in 6:46.1s; 73,000 reads per second ...
Processed 31,044,794 reads in 6:59.9s; 73,000 reads per second ...
Processed 32,046,528 reads in 7:14.2s; 72,000 reads per second ...
Processed 33,048,283 reads in 7:27.5s; 73,000 reads per second ...
Processed 34,050,017 reads in 7:41.4s; 72,000 reads per second ...
Processed 35,050,334 reads in 7:54.7s; 72,000 reads per second ...
ERROR: Unhandled exception in thread:
    line_reader::refill_buffers_gzip: unknown error ('incorrect data check'):
    iostream error
ERROR: AdapterRemoval did not run to completion;
       do NOT make use of resulting trimmed reads! 

Later I got a different, OS error, when I later edited the targets in my makefile to be specifically SE using the "Single:" key, as indicated in the Documentation). That is the STDERR it produced - this time an OS error, in the file validation stage:

Traceback (most recent call last):
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py", line 122, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py", line 114, in main
    return module.main(argv[1:])
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/tools/validate_fastq.py", line 48, in main
    for record in FASTQ.from_file(filename):
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/common/formats/fastq.py", line 107, in from_file
    yield from FASTQ.from_lines(handle)
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/common/formats/fastq.py", line 82, in from_lines
    separator = next(lines_iter).rstrip()
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 454, in read
    self._read_eof()
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 501, in _read_eof
    hex(self._crc)))
OSError: CRC check failed 0xffffffff != 0xfb10cbb7

Any suggestions or ideas on what could be done? Best wishes, Uri

MikkelSchubert commented 8 months ago

Hi Uri,

It looks like the Vole_test_raw/MM1000.fastq.gz file is corrupt. Try running gzip -vt Vole_test_raw/MM1000.fastq.gz to test it; that will likely report a similar error.

You will need to either find a non-corrupt copy of the file somewhere or truncate it so that it only contains valid data (i.e. unpack as much as you can, drop the last read if it is not complete, and then re-compress). If you've mapped this previously (manually or using a different pipeline), then I'd also recommend going back and verifying that it was not mapped using the corrupted file.

Note also that the Single etc. keys are meant only for reads that have already been trimmed and optionally merged. They should not be used if that is not the case.

Best, Mikkel

Uriwolkow commented 8 months ago

I thought the Single key doesn't really help here, I understand now that it is only for pre-trimmed reads (not my case). I'll check the file soon and see if I can use a non-corrupt copy. Maybe the issues with my other samples also had to do with the files, and not the pipeline. Thanks for the quick reply!