Closed jayluo2 closed 9 months ago
@jayluo2 Hi Jay,
Since the data is pretty large, I would suggest you test your script for a small group of cells or you save the bed files for subclass level or L3 or L4 level, then later you can merge the bed files?
Songpeng
@jayluo2 This is how I export bed files for peak calling: https://github.com/beyondpie/CEMBA_wmb_snATAC/blob/main/03.peakcalling/src/main/python/prepare_bedfiles.py I use Snakefile to organize this, so that multiple groups can be run simultaneously.
snapatac2.read()
and the contents appear normal:AnnData object with n_obs × n_vars = 7350 × 5451055
obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score'
var: 'count', 'selected'
uns: 'reference_sequences', 'scrublet_sim_doublet_score'
obsm: ‘insertion'
anndata.X
yields
<7350x5451055 sparse matrix of type '<class 'numpy.uint32'>'
with 56624600 stored elements in Compressed Sparse Row format>
and the .obs slot looks fine as well.
@jayluo2
@beyondpie .obsm['insertion’]
on this h5ad file (CEMBA180129_3A_rm_dlt.h5ad) returns
<7350x2725521370 sparse matrix of type '<class 'numpy.uint8'>'
with 95603440 stored elements in Compressed Sparse Row format>
using export_bed() with a column of ‘mock labels’ (alphabetical order) works just fine as well, with the following command (“anndata” = same h5ad file)
snapatac2.ex.export_bed(anndata,
groupby='mock_labels',
out_dir=output_dir,
suffix='.bed.gz')
The fragments files also appear to be fine. e.g.
zcat a.bed.gz | head
chr1 4186842 4186843 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 4187029 4187030 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 4187030 4187031 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 4187283 4187284 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 5070349 5070350 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 5070467 5070468 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 7138690 7138691 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 7138914 7138915 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 7378213 7378214 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1 7378408 7378409 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
and
zcat u.bed.gz | head
chr1 3588059 3588060 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 3588103 3588104 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 5017916 5017917 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 5017979 5017980 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 7397876 7397877 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 7397997 7397998 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 22296964 22296965 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 22297030 22297031 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 24611574 24611575 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1 24611626 24611627 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
@beyondpie Regarding your second question, I identified some more discrepancies similar to HY GABA above:
Discrepancy for HY_GABA: 205488208 fragments (metadata), but class-level bed.gz file has 226103206 rows
Discrepancy for CTX-CGE_GABA: 233809950 fragments (metadata), but class-level bed.gz file has 290152494 rows
Discrepancy for CTX-MGE_GABA: 443671252 fragments (metadata), but class-level bed.gz file has 550481310 rows
Discrepancy for MH-LH_Glut: 34059796 fragments (metadata), but class-level bed.gz file has 34759628 rows
Discrepancy for NP-CT-L6b_Glut: 784204001 fragments (metadata), but class-level bed.gz file has 996287794 rows
Discrepancy for Astro-Epen: 1147969857 fragments (metadata), but class-level bed.gz file has 1258567744 rows
Discrepancy for CNU-HYa_Glut: 143951176 fragments (metadata), but class-level bed.gz file has 166911254 rows
Discrepancy for MB_Glut: 750316389 fragments (metadata), but class-level bed.gz file has 815605023 rows
Discrepancy for HY_Glut: 161648292 fragments (metadata), but class-level bed.gz file has 177754869 rows
Discrepancy for TH_Glut: 356853925 fragments (metadata), but class-level bed.gz file has 455732012 rows
Discrepancy for P_GABA: 126838072 fragments (metadata), but class-level bed.gz file has 133969424 rows
Discrepancy for Immune: 210318453 fragments (metadata), but class-level bed.gz file has 234100458 rows
Discrepancy for Vascular: 237619853 fragments (metadata), but class-level bed.gz file has 262067892 rows
Discrepancy for CB_Glut: 427319942 fragments (metadata), but class-level bed.gz file has 476247064 rows
Discrepancy for OPC-Oligo: 1837739725 fragments (metadata), but class-level bed.gz file has 1912232980 rows
Discrepancy for IT-ET_Glut: 3911482310 fragments (metadata), but class-level bed.gz file has 4824967792 rows
Discrepancy for CNU-HYa_GABA: 238441380 fragments (metadata), but class-level bed.gz file has 285071090 rows
Discrepancy for CB_GABA: 50524434 fragments (metadata), but class-level bed.gz file has 57209872 rows
Discrepancy for MB_GABA: 448868487 fragments (metadata), but class-level bed.gz file has 572258267 rows
These file line counts are of course at the time of the error.
I agree with you and think the source of these discrepancies must be in how the "# of Fragments” column in Supplementary Table 2 is generated. Do you have thoughts on which one of the two (i.e. Supplementary Table 2 or bed.gz line counts from export_bed()) is the most filtered/updated version?
@jayluo2 Thank you so much for the feedback!
I will keep this open and see if @wkl1990 later will give us some feedback. But currently, I suggest you just use the ones you got and temporally ignore the inconsistencies between yours and the ones in Supplementary File 2.
Sincerely, Songpeng
@beyondpie Thank you for the details!
PanicException
) do not allow fragments to be written to completion. I’m not sure if this error is related to the fact that the Tn5 insertions in the h5ad files are not filtered for the mm10 blacklist.Best, Jay
@jayluo2
Best, Songpeng
@beyondpie Thank you for clarifying! I tried grouping by subclass with the same command,
snapatac2.ex.export_bed(adataset,
groupby='Subclass',
suffix='.bed.gz',
out_dir = output_dir
)
but received the same error:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Wed Dec 20 16:47:56 2023
, filename = ‘/path/to/CEMBA181022_6B_rm_dlt.h5ad', file descriptor = 142, errno = 5, error message = 'Input/output error', buf = 0x62c1890f, total read size = 33416, bytes this sub-read = 33416, bytes actually read = 18446744073709551615, offset = 0', /root/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/d54b207/anndata/src/container/base.rs:1160:85
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[40], line 5
3 print(f'Writing outputs to f{subclass_level_fragments_dir}')
4 print('='*50)
----> 5 snapatac2.ex.export_bed(adataset,
6 groupby='Subclass',
7 suffix='.bed.gz',
8 out_dir = subclass_level_fragments_dir
9 )
File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/export/__init__.py:55, in export_bed(adata, groupby, selections, ids, out_dir, prefix, suffix)
52 elif isinstance(ids, str):
53 ids = adata.obs[ids]
---> 55 return internal.export_bed(
56 adata, list(ids), list(groupby), selections, out_dir, prefix, suffix,
57 )
PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Wed Dec 20 16:47:56 2023
, filename = ‘/path/to/CEMBA181022_6B_rm_dlt.h5ad', file descriptor = 142, errno = 5, error message = 'Input/output error', buf = 0x62c1890f, total read size = 33416, bytes this sub-read = 33416, bytes actually read = 18446744073709551615, offset = 0
I am not sure if this is due to any particular h5ad file. Would it be possible to upload subclass-level fragments bed files to catlas.org?
Best, Jay
@jayluo2 Sorry to hear that you still have the errors.
Songpeng
@beyondpie Apologies for the delay in uploading bed files. I will upload them to Catlas and notify you once it's completed.
@beyondpie @wkl1990 Thank you both for your help, and that all sounds great!
@beyondpie Below is the code:
import os
import numpy as np
import pandas as pd
import snapatac2
# paths
cell_level_metadata_txt_path = os.path.join(‘path/to/Supplementary_Table_2.txt')
subclass_class_mapping_metadata_txt_path = os.path.join(‘path/to/AIT21_annotation_freeze_081523.tsv')
raw_anndata_dir = ‘/path/to/raw_anndata_files'
anndataset_dir = ‘/path/to/save/anndataset_object'
output_dir = ‘/path/to/output/fragments‘
# Load metadata (all 2.3M cells)
whole_mouse_brain_2023_metadata_df = pd.read_table(cell_level_metadata_txt_path)
# Get a list of absolute paths to sample h5ad files
adata_paths_l = [os.path.join(raw_anndata_dir, filename) for filename in raw_h5ad_filenames_l]
adatas_l = [(path.split('/')[-1].split('_rm_dlt.h5ad')[0], path) for path in adata_paths_l]
# Get AnnDataSet object
adataset = snapatac2.AnnDataSet(
adatas=adatas_l,
filename=anndataset_path
)
# Ensure ‘obs_names’ are unique
# use "." delimiter because that's what's done w/ cell-level metadata
unique_cell_ids = [sa + '.' + bc for sa, bc in zip(adataset.obs['sample'], adataset.obs_names)]
adataset.obs_names = unique_cell_ids
assert adataset.n_obs == np.unique(adataset.obs_names).size
# Get Cell-to-Subclass mapping
cellid_to_subclass_dict = dict(zip(whole_mouse_brain_2023_metadata_df['CellID'], whole_mouse_brain_2023_metadata_df['Subclass']))
# Create a Series from adataset.obs_names
cell_barcodes_series = pd.Series(adataset.obs_names)
# Map the ’Subclass' from 'cellid_to_subclass_dict' to the Series
subclass_labels = cell_barcodes_series.map(cellid_to_subclass_dict)
# Replace all spaces with underscores in the 'Class' column
subclass_labels = subclass_labels.str.replace(' ', '_', regex=False)
assert subclass_labels.isna().any() == False
# Add the class labels as a new column to adataset.obs
adataset.obs['Subclass'] = subclass_labels
# Export subclass-level fragments BED files
snapatac2.ex.export_bed(adataset,
groupby='Subclass',
suffix='.bed.gz',
out_dir = output_dir
)
Best, Jay
@jayluo2
Here we also uploaded the hdf5 files (see supplementary file: GSE246791_RAW.tar) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246791. You can give this a try.
I did not see apparent bugs in your codes. Previously I found that I cannot write some columns easily into SnapATAC2 AnnData/AnnDataSet. The groupby
parameter supports an independent list: https://github.com/beyondpie/CEMBA_wmb_snATAC/blob/8d090924b7a75ef6bf12f6a6ee396092d9437559/03.peakcalling/src/main/python/prepare_bedfiles.py#L83
This is how I export the bed files.
And your error is about reading the file, but you can red the file rightly if you work on that particular file. I don't know what happened. This looks strange to me. You can close the hdf5 files or close your Python REPL, and give your codes a try?
@wkl1990 Thank you so much! Did you have any errors before for this task? Or you can share your codes to @jayluo2 to let him run yours?
Songpeng
@beyondpie @jayluo2 The fragments in the subclass have been updated in Catlas. You can download them here: http://catlas.org/wholemousebrain/#!/downloads
@wkl1990 @beyondpie Thank you so much! I will proceed with those files for now and close this issue, but would appreciate any updates regarding the h5ad files as well. Thank you both for all your help!
@beyondpie @wkl1990 I see that there are 338 bed.gz fragments files in the uploaded directory. Could you please clarify why there are 275 cCRE files (which appear to be subclass level) but 338 subclass fragments files? I am specifically referring to this part of the paper: "Finally, 275 out of 338 subclasses were annotated to the 1,482 subtypes. This includes 253 out of 315 neuronal subclasses, covering 28 neuronal classes and 7 neurotransmitter types, as well as 22 out of 23 non-neuronal subclasses, covering 5 non-neuronal classes.”
My interpretation of this is that only 275 of the 338 scRNA-seq subclasses anchored properly to the snATAC-seq data. Is this correct? In that case, I presume the corresponding fragments files for these 275 subclasses are the most relevant files for downstream analyses?
Best, Jay
@jayluo2 Would you mind to check the number of the files? Kangli just add subclass_id for each of the file, so "338_blabla" only means the index not the number of files.
@beyondpie I see 275 files as expected — thank you!
Dear CEMBA team,
I am trying to write fragments bed files to disk through an AnnDataSet object (with all 2.3M cells from all .h5ad sample files;
AnnDataSet object with n_obs x n_vars = 2355842 x 5451055
) using SnapATAC 2.4 below. ‘Class’ is a manually added column, using AIT21_annotation_freeze_081523.tsv (under this folder) for class-to-subclass conversion.and encountered the following error (I’ve replaced my personal paths with
/path/to/
):I have verified that each cell is assigned the appropriate class label. At the time of this error (which occurred after hours of writing to disk), the line counts of the written bed.gz files do not match the number of fragments (grouped by class) in the metadata (Supplementary Table 2 of the paper), as expected.
Counting the number of lines (at the time of error) using
zcat <filename> | wc -l
yields:while grouping the metadata by class (for all 2.3M cells) yields
Strangely, I do notice that for the HY GABA class, the number of written lines (226103206) exceeds the number of cells belonging to that class in the metadata (205488208) by ~20M entries, so I’m not sure what is going on there.
I'm also not sure if this is due to the particularly large files, but the error occurred after much saving has already occurred. Would appreciate any thoughts on this!
Best, Jay