colomemaria / epiScanpy

Episcanpy: Epigenomics Single Cell Analysis in Python
BSD 3-Clause "New" or "Revised" License
140 stars 33 forks source link

Problems when generating matrix (episcanpy.ct.bld_mtx_fly()) #100

Open malumbres opened 3 years ago

malumbres commented 3 years ago

Dear Anna, congratulations for this package! I am a big fan of the scanpy environment.

I was wondering whether you have any tutorial for scATAC-seq from 10XGenomics ( or scRNA-seq-scATA-seq). Specifically, I have the following problem when reading ATAC-seq 10X data:

epi.ct.bld_mtx_fly(tsv_file="atac_fragments.tsv.gz", annotation="atac_peak_annotation.tsv", save="test.h5ad", )

ERROR:

loading barcodes

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-12-4093bfdf2ff8> in <module>
      4 filename = P + "test.h5ad"
      5 
----> 6 epi.ct.bld_mtx_fly(tsv_file="atac_fragments.tsv.gz",
      7                    annotation="atac_peak_annotation.tsv",
      8                    save="test.h5ad",

~/opt/anaconda3/lib/python3.8/site-packages/episcanpy/count_matrix/_bld_atac_mtx.py in bld_mtx_fly(tsv_file, annotation, csv_file, genome, save)
     39 
     40         print('loading barcodes')
---> 41         barcodes = sorted(pd.read_csv(tsv_file, sep='\t', header=None).loc[:, 3].unique().tolist())
     42 
     43         # barcodes

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    603     kwds.update(kwds_defaults)
    604 
--> 605     return _read(filepath_or_buffer, kwds)
    606 
    607 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    461 
    462     with parser:
--> 463         return parser.read(nrows)
    464 
    465 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows)
   1050     def read(self, nrows=None):
   1051         nrows = validate_integer("nrows", nrows)
-> 1052         index, columns, col_dict = self._engine.read(nrows)
   1053 
   1054         if index is None:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows)
   2054     def read(self, nrows=None):
   2055         try:
-> 2056             data = self._reader.read(nrows)
   2057         except StopIteration:
   2058             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 53, saw 5

These are lines 52 and 53 in the tsv file:

# primary_contig=JH584295.1 | &nbsp; | &nbsp;
-- | -- | --
chr1 | 3000087 | 3000282 | GCCAATTAGCACTAAC-1 | 1
chr1 | 3001599 | 3001786 | AAGGTATAGCAGGTGG-1 | 1

Many thanks! Marcos