UcarLab / AMULET

A count based method for detecting doublets from single nucleus ATAC-seq (snATAC-seq) data.
https://ucarlab.github.io/AMULET/
GNU General Public License v3.0
29 stars 5 forks source link

Multiome input #11

Open tinakeshav opened 2 years ago

tinakeshav commented 2 years ago

Hi AMULET team, I have a 10X multiome kit dataset and trying to run AMULET on the ATAC data. The fragments input file is the same, but the singlecell.csv is no longer available as CellRanger output. Instead, the following per_barcode_metrics.csv is produced.

barcode,gex_barcode,atac_barcode,is_cell,excluded_reason,gex_raw_reads,gex_mapped_reads,gex_conf_intergenic_reads,gex_conf_exonic_reads,gex_conf_intronic_reads,gex_conf_exonic_unique_reads,gex_conf_exonic_antisense_reads,gex_conf_exonic_dup_reads,gex_exonic_umis,gex_conf_intronic_unique_reads,gex_conf_intronic_antisense_reads,gex_conf_intronic_dup_reads,gex_intronic_umis,gex_conf_txomic_unique_reads,gex_umis_count,gex_genes_count,atac_raw_reads,atac_unmapped_reads,atac_lowmapq,atac_dup_reads,atac_chimeric_reads,atac_mitochondrial_reads,atac_fragments,atac_TSS_fragments,atac_peak_region_fragments,atac_peak_region_cutsites
AAACAGCCAAACAACA-1,AAACAGCCAAACAACA-1,ACAGCGGGTGTGTTAC-1,0,0,12,12,0,4,7,4,0,2,2,0,7,0,0,4,0,0,2,0,0,0,0,0,2,1,1,2 

The is_cell column corresponds to the is__cell_barcode from singlecell.csv I believe. I've tried a quick and dirty solution of renaming the column, which failed. The atac_barcode column corresponds to the barcode column from singlecell.csv .

Is there a good way to modify the source code accordingly to take care of this? Thanks so much, Tina

ajt986 commented 2 years ago

Hi Tina,

If you are working with the fragment file, then the python code is looking for "is__cell_barcode" and "barcode" columns in the file. So if you rename/generate a CSV with these columns it should be able to process the fragment file. Also note that the is__cell_barcode is looking for values == 1. Alternatively you can modify these in the python code (FragmentFileOverlapCounter.py):

line 231: sc_data = sc_data[sc_data['is__cell_barcode'] == 1]

line 239: for curbarcode in sc_data['barcode']:

If using the alignments (.bam), then you need to specify the correct index of these columns using the --bcidx, --cellidx, and --iscellidx parameters.

Best, Asa

tinakeshav commented 2 years ago

Hello, I've fixed those and I'm getting the error pasted below. The Overlaps.txt file is empty (see below). Any guidance on how to navigate this?

(base) [tkeshava@node59 test-am]$ cat Overlaps.txt
chr     start   end     cell id Min Overlap Count       Max Overlap Count       Mean Mapping Quality    Min Mapping Quality     Max Mapping Quality     Starts  Ends

Accordingly, the OverlapsSummary.txt looks like the following

Cell Id Number of Valid Reads   Number of Overlaps      Barcode Total Number of Reads
AAACAGCCAGAACCGA-1      0       0       AAACAGCCAGAACCGA-1      8404

Error message:

/cluster/home/tkeshava/FragmentFileOverlapCounter.py:193: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  table = np.empty(shape=(len(overlapcounts),len(colnames)), dtype=np.object)
/cluster/home/tkeshava/miniconda3/envs/ATACseq/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/cluster/home/tkeshava/miniconda3/envs/ATACseq/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/cluster/home/tkeshava/FragmentFileOverlapCounter.py", line 354, in <module>
    findOverlaps(fragmentfile, singlecellfile, expectedoverlap, chromosomes, outdir, maxinsert)
  File "/cluster/home/tkeshava/FragmentFileOverlapCounter.py", line 352, in findOverlaps
    print("Max overlaps in segment: "+str(np.max(stats_overlapsizes)))
  File "<__array_function__ internals>", line 5, in amax
  File "/cluster/home/tkeshava/miniconda3/envs/ATACseq/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2754, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/cluster/home/tkeshava/miniconda3/envs/ATACseq/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
Traceback (most recent call last):
  File "/cluster/home/tkeshava/AMULET.py", line 201, in <module>
    lengths = filtereddata[:,2]-filtereddata[:,1]+1
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
ajt986 commented 2 years ago

Seeing the number of valid reads as 0 when using the fragment file overlap counter script tells me that either:

1) The insert sizes of the paired-end fragments are > 900 (default) for all of the reads or, 2) The chromosomes are not matching the chromosomes in the file provided as input

These are the only criteria used to count "valid reads" when using the fragment file which has already filtered alignments for various flags. Double checking the chromosome names may solve the issue. For example, the human_autosome.txt file provided as default list the chromosomes as chr1 chr2 etc, while the fragment file may have just 1, 2 etc. If this is the case, you will just need to create a similar file with the correct chromosome formatting.

Alternatively, you can comment out these two lines in FragmentFileOverlapCounter.py:

        if curchr not in chrlist:
            continue #skip reads from chromosomes not in the provided chromosome list

by putting a # in front of both. In this case, there is no filtering based on chromosomes at all.

The valid read count should be close to the total read count. Hope this helps!

bjstewart1 commented 2 years ago

I encountered this problem with running AMULET on multiome. The barcode column matching the barcodes that AMULET is trying to read in atac_fragments.tsv.gz in per_barcode_metrics.csv is NOT "atac_barcode", it is "barcode". So do this over your 10X multiome cellranger outputs:

for(f in files){
    message(f)
    bc_metrics = read.csv(file.path("path", "to", f, "per_barcode_metrics.csv")) #read in the per_barcode_metrics.csv
    df = data.frame("barcode" = bc_metrics$barcode, "is__cell_barcode" =  bc_metrics$is_cell) #make a dataframe that can be used as "single_cell.csv"
    write.csv(df, file.path("path", "to", f, "single_cell.csv"), row.names = FALSE) #write out the dataframe as "single_cell.csv"
    }

Now run AMULET using single_cell.csv

ajt986 commented 2 years ago

I'm curious about using barcode instead of atac_barcode here. I'm checking the Cell Ranger Arc documentation and when I check:

https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/per_barcode_metrics

Barcode is the same as GEX barcode, while atac_barcode is the ATAC library

The fragment file barcode corresponds to the BAM file barcode: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/fragments

I would imagine that the barcode in the bam file corresponds to the atac_barcode based on when the BAM is generated: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/algorithms/overview

I'm just trying to understand this myself since this is important to get correct.

kwcurrin commented 2 years ago

Hi Asa,

10X could make this more clear, but the barcode in the ATAC bam file is the "gex" barcode. They translate the ATAC barcode to the matching gex barcode before adding it to the ATAC bam: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/bam-atac#bam-barcode-translation

Thanks,

Kevin

ollieeknight commented 1 year ago

I encountered this problem with running AMULET on multiome. The barcode column matching the barcodes that AMULET is trying to read in atac_fragments.tsv.gz in per_barcode_metrics.csv is NOT "atac_barcode", it is "barcode". So do this over your 10X multiome cellranger outputs:

for(f in files){
    message(f)
    bc_metrics = read.csv(file.path("path", "to", f, "per_barcode_metrics.csv")) #read in the per_barcode_metrics.csv
    df = data.frame("barcode" = bc_metrics$barcode, "is__cell_barcode" =  bc_metrics$is_cell) #make a dataframe that can be used as "single_cell.csv"
    write.csv(df, file.path("path", "to", f, "single_cell.csv"), row.names = FALSE) #write out the dataframe as "single_cell.csv"
    }

Now run AMULET using single_cell.csv

After using cellranger-arc-2.0.2 to process my multiome data, I needed to perform these steps in R and also add quote = FALSE in the last write.csv() step.

After then quitting R, my amulet command line script looked like:

bash AMULET.sh --forcesorted --iscellidx 1 --bcidx 0 $outs/multiome/$sample/outs/atac_possorted_bam.bam $outs/multiome/$sample/outs/single_cell.csv human_autosomes.txt RestrictionRepeatLists/restrictionlist_repeats_segdups_rmsk_hg38.bed $outs/multiome/$sample/outs/amulet/ ../amulet/ 
alexlenail commented 1 year ago

I think you could also just

cat per_barcode_metrics.csv | cut -d, -f1,4 | sed "1s/.*/barcode,is__cell_barcode/" > singlecell.csv