MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
26 stars 9 forks source link

`maxatac prepare` issue processing scATAC-seq fragment files #97

Closed tacazares closed 2 years ago

tacazares commented 2 years ago

@anthonybejjani had an issue with preparing scATAC-seq fragment files. I was able to reproduce this error.

                             _______       _____ 
                          /\|__   __|/\   / ____|
 _ __ ___   __ ___  __   /  \  | |  /  \ | |     
| '_ ` _ \ / _` \ \/ /  / /\ \ | | / /\ \| |     
| | | | | | (_| |>  <  / ____ \| |/ ____ \ |____ 
|_| |_| |_|\__,_/_/\_\/_/    \_\_/_/    \_\_____|

[2022-05-27 12:54:29,022]
Input file: /Users/caz3so/scratch/20220525_maxatac_scatac_subset/GM12878_scATAC_10k_fragments.tsv.gz 
Input chromosome sizes file: /Users/caz3so/opt/maxatac/data/hg38/hg38.chrom.sizes 
Tn5 cut sites will be slopped 20 bps on each side 
Input blacklist file: /Users/caz3so/opt/maxatac/data/hg38/hg38_maxatac_blacklist.bw 
Output filename: GM12878_scatac_10k 
Output directory: /Users/caz3so/scratch/20220525_maxatac_scatac_subset 
Using a millions factor of: 20000000 
Using 9 threads to run job.
[2022-05-27 12:54:29,023]
Generate the normalized signal tracks.
[2022-05-27 12:54:29,023]
Working on 10X scATAC fragments file 
 Converting fragment files to Tn5 sites
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/caz3so/workspaces/miraldiLab/maxATAC/maxatac/bin/maxatac", line 24, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/Users/caz3so/workspaces/miraldiLab/maxATAC/maxatac/bin/maxatac", line 20, in main
    args.func(args)
  File "/Users/caz3so/workspaces/miraldiLab/maxATAC/maxatac/analyses/prepare.py", line 93, in run_prepare
    bed_df = convert_fragments_to_tn5_bed(args.input, ALL_CHRS)
  File "/Users/caz3so/workspaces/miraldiLab/maxATAC/maxatac/utilities/prepare_tools.py", line 25, in convert_fragments_to_tn5_bed
    df = pd.read_table(fragments_tsv,
  File "/Users/caz3so/opt/anaconda3/envs/maxatac/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/Users/caz3so/opt/anaconda3/envs/maxatac/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 779, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/Users/caz3so/opt/anaconda3/envs/maxatac/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/Users/caz3so/opt/anaconda3/envs/maxatac/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/Users/caz3so/opt/anaconda3/envs/maxatac/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 787, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'GAGATTCCAAAGGTCG-1'

I was able to reproduce this error using my mac. The problem code is the function for converting the fragments to Tn5 sites. Specifically, the code that reads in the fragments file. I had originally coded the function to set the col_types and names. This was to try to save memory and speed up reading the text file.

    col_types = {
                 "chr": "category", 
                 "start": "int32", 
                 "stop": "int32", 
                 "barcode": "category", 
                 "support": "int32"
                }

    # Import fragments tsv as a dataframe
    df = pd.read_table(fragments_tsv,
                       header=None,
                       names=["chr", "start", "stop", "barcode", "support"],
                       dtype=col_types, 
                       low_memory=False)

I tested reading in the file without setting the data types and the code worked. The data type is not necessary for this function to work, but it seems to be causing issues. I fixed the function for importing text files with:

    # Import fragments tsv as a dataframe
    df = pd.read_table(fragments_tsv,
                       sep="\t",
                       header=None,
                       usecols=[0,1,2,3],
                       names=["chr", "start", "stop", "barcode"]
                       )