CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

extract error #456

Closed gitamahm closed 3 years ago

gitamahm commented 3 years ago

Hi, I have tried to run the extract command, but I get the following error (see log below, edited to contain relative paths for easier viewing). I have no problems running the extract example in the UMI-tools tutorial. Any help would be much appreciated.

UMI-tools version: 1.0.1
# output generated by extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN 
--stdin ./raw_reads/TSP1_exopancreas1_3_S21_L003_R1_001.fastq.gz 
--stdout ./umiFiltered/TSP1_exopancreas1_3_S21_L003_extracted_R1.fastq.gz 
--read2-in ./raw_reads/TSP1_exopancreas1_3_S21_L003_R2_001.fastq.gz 
--read2-out=./umiFiltered/TSP1_exopancreas1_3_S21_L003_extracted_R2.fastq.gz 
--whitelist=./umiFiltered/TSP1_exopancreas1_3_S21_L003_whitelist.txt 
--log=./logs/TSP1_exopancreas1_3_S21_L003_extracted_R2.log

# job started at Mon Feb  8 13:08:51 2021 on sh03-04n33.int -- 188a47c0-0722-4c9b-8caa-fce971d35915
# pid: 15062, system: Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64
# blacklist                               : None
# compresslevel                           : 6
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : CCCCCCCCCCCCCCCCNNNNNNNNNNNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : ./raw_reads/TSP1_exopancreas1_3_S21_L003_R2_001.fastq.gz
# read2_out                               : ./umiFiltered/TSP1_exopancreas1_3_S21_L003_extracted_R2.fastq.gz
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='./raw_reads/TSP1_exopancreas1_3_S21_L003_R1_001.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='./logs/TSP1_exopancreas1_3_S21_L003_extracted_R2.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='./umiFiltered/TSP1_exopancreas1_3_S21_L003_extracted_R1.fastq.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# whitelist                               : ./umiFiltered/TSP1_exopancreas1_3_S21_L003_whitelist.txt

2021-02-08 13:08:51,839 INFO Starting barcode extraction
2021-02-08 13:08:55,847 INFO Parsed 100000 reads
2021-02-08 13:08:59,652 INFO Parsed 200000 reads
2021-02-08 13:09:03,453 INFO Parsed 300000 reads
2021-02-08 13:09:07,281 INFO Parsed 400000 reads
2021-02-08 13:09:11,091 INFO Parsed 500000 reads
2021-02-08 13:09:14,893 INFO Parsed 600000 reads
2021-02-08 13:09:18,690 INFO Parsed 700000 reads
2021-02-08 13:09:22,487 INFO Parsed 800000 reads
2021-02-08 13:09:26,317 INFO Parsed 900000 reads
2021-02-08 13:09:30,125 INFO Parsed 1000000 reads
2021-02-08 13:09:33,941 INFO Parsed 1100000 reads
2021-02-08 13:09:37,761 INFO Parsed 1200000 reads
2021-02-08 13:09:41,583 INFO Parsed 1300000 reads
2021-02-08 13:09:45,396 INFO Parsed 1400000 reads
2021-02-08 13:09:49,202 INFO Parsed 1500000 reads
2021-02-08 13:09:53,007 INFO Parsed 1600000 reads
2021-02-08 13:09:56,807 INFO Parsed 1700000 reads
2021-02-08 13:10:00,610 INFO Parsed 1800000 reads
2021-02-08 13:10:04,407 INFO Parsed 1900000 reads
2021-02-08 13:10:08,204 INFO Parsed 2000000 reads
2021-02-08 13:10:12,008 INFO Parsed 2100000 reads
2021-02-08 13:10:15,811 INFO Parsed 2200000 reads
2021-02-08 13:10:19,608 INFO Parsed 2300000 reads
2021-02-08 13:10:23,406 INFO Parsed 2400000 reads
2021-02-08 13:10:27,209 INFO Parsed 2500000 reads
2021-02-08 13:10:31,001 INFO Parsed 2600000 reads
2021-02-08 13:10:34,789 INFO Parsed 2700000 reads
2021-02-08 13:10:38,577 INFO Parsed 2800000 reads
2021-02-08 13:10:42,373 INFO Parsed 2900000 reads
2021-02-08 13:10:46,176 INFO Parsed 3000000 reads
2021-02-08 13:10:49,984 INFO Parsed 3100000 reads
2021-02-08 13:10:53,748 INFO Parsed 3200000 reads
2021-02-08 13:10:57,492 INFO Parsed 3300000 reads
2021-02-08 13:11:01,258 INFO Parsed 3400000 reads
2021-02-08 13:11:05,028 INFO Parsed 3500000 reads
2021-02-08 13:11:08,803 INFO Parsed 3600000 reads
2021-02-08 13:11:12,587 INFO Parsed 3700000 reads
2021-02-08 13:11:16,385 INFO Parsed 3800000 reads
2021-02-08 13:11:20,190 INFO Parsed 3900000 reads
2021-02-08 13:11:23,994 INFO Parsed 4000000 reads
2021-02-08 13:11:27,813 INFO Parsed 4100000 reads
2021-02-08 13:11:31,618 INFO Parsed 4200000 reads
2021-02-08 13:11:35,429 INFO Parsed 4300000 reads
2021-02-08 13:11:39,251 INFO Parsed 4400000 reads
2021-02-08 13:11:43,068 INFO Parsed 4500000 reads
2021-02-08 13:11:46,889 INFO Parsed 4600000 reads
2021-02-08 13:11:50,709 INFO Parsed 4700000 reads
2021-02-08 13:11:54,523 INFO Parsed 4800000 reads
2021-02-08 13:11:58,338 INFO Parsed 4900000 reads
2021-02-08 13:12:02,164 INFO Parsed 5000000 reads
2021-02-08 13:12:05,980 INFO Parsed 5100000 reads
2021-02-08 13:12:09,793 INFO Parsed 5200000 reads
2021-02-08 13:12:13,606 INFO Parsed 5300000 reads
2021-02-08 13:12:17,414 INFO Parsed 5400000 reads
2021-02-08 13:12:21,227 INFO Parsed 5500000 reads
2021-02-08 13:12:25,031 INFO Parsed 5600000 reads
2021-02-08 13:12:28,838 INFO Parsed 5700000 reads
2021-02-08 13:12:32,643 INFO Parsed 5800000 reads
2021-02-08 13:12:36,449 INFO Parsed 5900000 reads
2021-02-08 13:12:40,260 INFO Parsed 6000000 reads
2021-02-08 13:12:44,061 INFO Parsed 6100000 reads
2021-02-08 13:12:47,881 INFO Parsed 6200000 reads
2021-02-08 13:12:51,633 INFO Parsed 6300000 reads
2021-02-08 13:12:55,385 INFO Parsed 6400000 reads
2021-02-08 13:12:59,148 INFO Parsed 6500000 reads
2021-02-08 13:13:02,917 INFO Parsed 6600000 reads
Traceback (most recent call last):
  File "/home/miniconda3/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/miniconda3/lib/python3.7/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/miniconda3/lib/python3.7/site-packages/umi_tools/extract.py", line 397, in main
    read1s, read2s, strict):
  File "/home/miniconda3/lib/python3.7/site-packages/umi_tools/umi_methods.py", line 115, in joinedFastqIterate
    for read1 in fastq_iterator1:
  File "/home/miniconda3/lib/python3.7/site-packages/umi_tools/umi_methods.py", line 79, in fastqIterate
    line1 = convert2string(infile.readline())
  File "/home/miniconda3/lib/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/home/miniconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/miniconda3/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
IanSudbery commented 3 years ago

This looks to me to be a problem with the input files.

Do you have an MD5 hash for the files you can check against?

Also, what is the output of

$ gzip -t -v ./raw_reads/TSP1_exopancreas1_3_S21_L003_R1_001.fastq.gz 

and

$ gzip -t -v ./raw_reads/TSP1_exopancreas1_3_S21_L003_R2_001.fastq.gz 
gitamahm commented 3 years ago

This was indeed the problem since I would get "unexpected end of file" error by running those commands. Thanks!

On Tue, Feb 9, 2021 at 3:21 AM Ian Sudbery notifications@github.com wrote:

This looks to me to be a problem with the input files.

Do you have an MD5 hash for the files you can check against?

Also, what is the output of

$ gzip -t -v ./raw_reads/TSP1_exopancreas1_3_S21_L003_R1_001.fastq.gz

and

$ gzip -t -v ./raw_reads/TSP1_exopancreas1_3_S21_L003_R2_001.fastq.gz

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/456#issuecomment-775867633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJZ5KA5KPIFAL65NY4VPALS6ELC5ANCNFSM4XJ2KENA .