beyondpie / CEMBA_wmb_snATAC

Whole mouse brain snATAC seq analysis
MIT License
23 stars 3 forks source link

PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed #8

Closed jayluo2 closed 9 months ago

jayluo2 commented 9 months ago

Dear CEMBA team,

I am trying to write fragments bed files to disk through an AnnDataSet object (with all 2.3M cells from all .h5ad sample files; AnnDataSet object with n_obs x n_vars = 2355842 x 5451055) using SnapATAC 2.4 below. ‘Class’ is a manually added column, using AIT21_annotation_freeze_081523.tsv (under this folder) for class-to-subclass conversion.

snapatac2.ex.export_bed(adataset,
                       groupby='Class',
                       suffix='.bed.gz',
                       out_dir = class_level_fragments_dir
                       )

and encountered the following error (I’ve replaced my personal paths with /path/to/):

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0', /root/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/d54b207/anndata/src/container/base.rs:1160:85
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[51], line 5
      3 print(f'Writing outputs to f{class_level_fragments_dir}')
      4 print('='*50)
----> 5 snapatac2.ex.export_bed(adataset,
      6                        groupby='Class',
      7                        suffix='.bed.gz',
      8                        out_dir = class_level_fragments_dir
      9                        )

File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/export/__init__.py:55, in export_bed(adata, groupby, selections, ids, out_dir, prefix, suffix)
     52 elif isinstance(ids, str):
     53     ids = adata.obs[ids]
---> 55 return internal.export_bed(
     56     adata, list(ids), list(groupby), selections, out_dir, prefix, suffix,
     57 )

PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0

I have verified that each cell is assigned the appropriate class label. At the time of this error (which occurred after hours of writing to disk), the line counts of the written bed.gz files do not match the number of fragments (grouped by class) in the metadata (Supplementary Table 2 of the paper), as expected.

Counting the number of lines (at the time of error) using zcat <filename> | wc -l yields:

{'HY_GABA.bed.gz': 226103206,
 'CTX-CGE_GABA.bed.gz': 290152494,
 'CTX-MGE_GABA.bed.gz': 550481310,
 'MH-LH_Glut.bed.gz': 34759628,
 'NP-CT-L6b_Glut.bed.gz': 996287794,
 'MY_Glut.bed.gz': 79333970,
 'Astro-Epen.bed.gz': 1258567744,
 'CNU-HYa_Glut.bed.gz': 166911254,
 'LSX_GABA.bed.gz': 147192400,
 'DG-IMN_Glut.bed.gz': 336481046,
 'HY_Gnrh1_Glut.bed.gz': 783042,
 'OB-CR_Glut.bed.gz': 6790320,
 'CNU-MGE_GABA.bed.gz': 93358728,
 'MB_Glut.bed.gz': 815605023,
 'MY_GABA.bed.gz': 113820670,
 'HY_Glut.bed.gz': 177754869,
 'TH_Glut.bed.gz': 455732012,
 'CNU-LGE_GABA.bed.gz': 403055016,
 'P_GABA.bed.gz': 133969424,
 'Immune.bed.gz': 234100458,
 'Vascular.bed.gz': 262067892,
 'CB_Glut.bed.gz': 476247064,
 'OPC-Oligo.bed.gz': 1912232980,
 'IT-ET_Glut.bed.gz': 4824967792,
 'CNU-HYa_GABA.bed.gz': 285071090,
 'CB_GABA.bed.gz': 57209872,
 'MB_Dopa.bed.gz': 50937284,
 'P_Glut.bed.gz': 233308761,
 'MB_GABA.bed.gz': 572258267,
 'OB-IMN_GABA.bed.gz': 50274352,
 'MB-HB_Sero.bed.gz': 25413152,
 'OEC.bed.gz': 509394,
 'Pineal_Glut.bed.gz': 1446016}

while grouping the metadata by class (for all 2.3M cells) yields

Class
IT-ET Glut        3911482310
OPC-Oligo         1837739725
Astro-Epen        1147969857
NP-CT-L6b Glut     784204001
MB Glut            750316389
CNU-LGE GABA       493293499
MB GABA            448868487
CTX-MGE GABA       443671252
CB Glut            427319942
TH Glut            356853925
DG-IMN Glut        339200608
P Glut             264327471
CNU-HYa GABA       238441380
Vascular           237619853
CTX-CGE GABA       233809950
Immune             210318453
HY GABA            205488208
LSX GABA           194601248
MY GABA            162226871
HY Glut            161648292
CNU-HYa Glut       143951176
P GABA             126838072
CNU-MGE GABA       115760498
MY Glut            105435880
OB-IMN GABA         72832099
MB Dopa             54655953
CB GABA             50524434
MH-LH Glut          34059796
MB-HB Sero          26239541
OB-CR Glut           7518986
Pineal Glut          2123205
OEC                   851273
HY Gnrh1 Glut         785959
Name: # of Fragments, dtype: int64

Strangely, I do notice that for the HY GABA class, the number of written lines (226103206) exceeds the number of cells belonging to that class in the metadata (205488208) by ~20M entries, so I’m not sure what is going on there.

I'm also not sure if this is due to the particularly large files, but the error occurred after much saving has already occurred. Would appreciate any thoughts on this!

Best, Jay

beyondpie commented 9 months ago

@jayluo2 Hi Jay,

  1. May I know if only "CEMBA180129_3A_rm_dlt.h5ad" has the reading error?
  2. Since you have error on at least one file, I am not sure if there are some files failed so we have less number of fragments?

Since the data is pretty large, I would suggest you test your script for a small group of cells or you save the bed files for subclass level or L3 or L4 level, then later you can merge the bed files?

Songpeng

beyondpie commented 9 months ago

@jayluo2 This is how I export bed files for peak calling: https://github.com/beyondpie/CEMBA_wmb_snATAC/blob/main/03.peakcalling/src/main/python/prepare_bedfiles.py I use Snakefile to organize this, so that multiple groups can be run simultaneously.

jayluo2 commented 9 months ago
  1. I believe this would only be the first file that produced the error (since these are not just warnings), so I’m not sure if there are additional flies that may contribute. I’ve separately loaded “CEMBA180129_3A_rm_dlt.h5ad” using snapatac2.read() and the contents appear normal:
AnnData object with n_obs × n_vars = 7350 × 5451055
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score'
    var: 'count', 'selected'
    uns: 'reference_sequences', 'scrublet_sim_doublet_score'
    obsm: ‘insertion'

anndata.X yields

<7350x5451055 sparse matrix of type '<class 'numpy.uint32'>'
    with 56624600 stored elements in Compressed Sparse Row format>

and the .obs slot looks fine as well.

  1. The HY GABA case I pointed out is strange because the # of written lines (in bed.gz file) should be at most the # of fragments (of HY GABA class) from the metadata (Supplementary Table 2), so it is weird that for HY GABA, more fragments are written to disk (226103206 lines) than are present in the metadata (205488208 fragments).
beyondpie commented 9 months ago

@jayluo2

  1. So I am not sure what happened on the files. It looks OK from my view.
    • The fragments should be related with obsm['insertion']. Would you mind to check if it has data or not?
    • I suggest we firstly figure out what happened for this file. Would you mind to test export bed files for this file?
  2. @wkl1990 Kangli, do you have any comment about this? Is this because of
    • filtering of blacklist regions?
    • or ignoring fragments with the size of base pairs > 1000 when we count them in Supplementary Table 2?
    • @jayluo2 did you see any inconsistencies for other classes?
jayluo2 commented 9 months ago

@beyondpie .obsm['insertion’]on this h5ad file (CEMBA180129_3A_rm_dlt.h5ad) returns

<7350x2725521370 sparse matrix of type '<class 'numpy.uint8'>'
    with 95603440 stored elements in Compressed Sparse Row format>

using export_bed() with a column of ‘mock labels’ (alphabetical order) works just fine as well, with the following command (“anndata” = same h5ad file)

snapatac2.ex.export_bed(anndata,
                              groupby='mock_labels',
                             out_dir=output_dir,
                       suffix='.bed.gz')

The fragments files also appear to be fine. e.g.

zcat a.bed.gz | head 
chr1    4186842 4186843 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    4187029 4187030 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    4187030 4187031 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    4187283 4187284 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    5070349 5070350 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    5070467 5070468 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    7138690 7138691 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    7138914 7138915 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    7378213 7378214 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA
chr1    7378408 7378409 AGCGATAGAACCAGGTGAAGTATGGGCTCTGA

and

zcat u.bed.gz | head 
chr1    3588059 3588060 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    3588103 3588104 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    5017916 5017917 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    5017979 5017980 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    7397876 7397877 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    7397997 7397998 TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    22296964    22296965    TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    22297030    22297031    TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    24611574    24611575    TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
chr1    24611626    24611627    TCTCGCGCAAGAGGCAGCGTAAGAGTACTGAC
jayluo2 commented 9 months ago

@beyondpie Regarding your second question, I identified some more discrepancies similar to HY GABA above:

Discrepancy for HY_GABA: 205488208 fragments (metadata), but class-level bed.gz file has 226103206 rows
Discrepancy for CTX-CGE_GABA: 233809950 fragments (metadata), but class-level bed.gz file has 290152494 rows
Discrepancy for CTX-MGE_GABA: 443671252 fragments (metadata), but class-level bed.gz file has 550481310 rows
Discrepancy for MH-LH_Glut: 34059796 fragments (metadata), but class-level bed.gz file has 34759628 rows
Discrepancy for NP-CT-L6b_Glut: 784204001 fragments (metadata), but class-level bed.gz file has 996287794 rows
Discrepancy for Astro-Epen: 1147969857 fragments (metadata), but class-level bed.gz file has 1258567744 rows
Discrepancy for CNU-HYa_Glut: 143951176 fragments (metadata), but class-level bed.gz file has 166911254 rows
Discrepancy for MB_Glut: 750316389 fragments (metadata), but class-level bed.gz file has 815605023 rows
Discrepancy for HY_Glut: 161648292 fragments (metadata), but class-level bed.gz file has 177754869 rows
Discrepancy for TH_Glut: 356853925 fragments (metadata), but class-level bed.gz file has 455732012 rows
Discrepancy for P_GABA: 126838072 fragments (metadata), but class-level bed.gz file has 133969424 rows
Discrepancy for Immune: 210318453 fragments (metadata), but class-level bed.gz file has 234100458 rows
Discrepancy for Vascular: 237619853 fragments (metadata), but class-level bed.gz file has 262067892 rows
Discrepancy for CB_Glut: 427319942 fragments (metadata), but class-level bed.gz file has 476247064 rows
Discrepancy for OPC-Oligo: 1837739725 fragments (metadata), but class-level bed.gz file has 1912232980 rows
Discrepancy for IT-ET_Glut: 3911482310 fragments (metadata), but class-level bed.gz file has 4824967792 rows
Discrepancy for CNU-HYa_GABA: 238441380 fragments (metadata), but class-level bed.gz file has 285071090 rows
Discrepancy for CB_GABA: 50524434 fragments (metadata), but class-level bed.gz file has 57209872 rows
Discrepancy for MB_GABA: 448868487 fragments (metadata), but class-level bed.gz file has 572258267 rows

These file line counts are of course at the time of the error.

I agree with you and think the source of these discrepancies must be in how the "# of Fragments” column in Supplementary Table 2 is generated. Do you have thoughts on which one of the two (i.e. Supplementary Table 2 or bed.gz line counts from export_bed()) is the most filtered/updated version?

beyondpie commented 9 months ago

@jayluo2 Thank you so much for the feedback!

  1. So you can actually get the bed files, then I am a little bit confused with your first comment about the error. If you later notice that there is no such error any more, that would be great!
  2. I think the hdf5 files I shared are the ones we used. So you can use them for later analysis. Just remember that

I will keep this open and see if @wkl1990 later will give us some feedback. But currently, I suggest you just use the ones you got and temporally ignore the inconsistencies between yours and the ones in Supplementary File 2.

Sincerely, Songpeng

jayluo2 commented 9 months ago

@beyondpie Thank you for the details!

Best, Jay

beyondpie commented 9 months ago

@jayluo2

  1. Based on the link I shared with you, I should include all the fragments located in Chr1-19, ChrX and ChrY without any blacklist filtering. I am not sure about the discrepancies you saw, which I need to check with my colleague.
  2. Good.
  3. I haven't had the failures before. But I usually store the data into subclass or L4 level. So not sure if class-level is kind of too big to be written into a file (or memory or other issue)? Would you mind to test to work on with smaller groups?

Best, Songpeng

jayluo2 commented 9 months ago

@beyondpie Thank you for clarifying! I tried grouping by subclass with the same command,

snapatac2.ex.export_bed(adataset,
                       groupby='Subclass',
                       suffix='.bed.gz',
                       out_dir = output_dir
                       )

but received the same error:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Wed Dec 20 16:47:56 2023
, filename = ‘/path/to/CEMBA181022_6B_rm_dlt.h5ad', file descriptor = 142, errno = 5, error message = 'Input/output error', buf = 0x62c1890f, total read size = 33416, bytes this sub-read = 33416, bytes actually read = 18446744073709551615, offset = 0', /root/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/d54b207/anndata/src/container/base.rs:1160:85
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[40], line 5
      3 print(f'Writing outputs to f{subclass_level_fragments_dir}')
      4 print('='*50)
----> 5 snapatac2.ex.export_bed(adataset,
      6                        groupby='Subclass',
      7                        suffix='.bed.gz',
      8                        out_dir = subclass_level_fragments_dir
      9                        )

File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/export/__init__.py:55, in export_bed(adata, groupby, selections, ids, out_dir, prefix, suffix)
     52 elif isinstance(ids, str):
     53     ids = adata.obs[ids]
---> 55 return internal.export_bed(
     56     adata, list(ids), list(groupby), selections, out_dir, prefix, suffix,
     57 )

PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Wed Dec 20 16:47:56 2023
, filename = ‘/path/to/CEMBA181022_6B_rm_dlt.h5ad', file descriptor = 142, errno = 5, error message = 'Input/output error', buf = 0x62c1890f, total read size = 33416, bytes this sub-read = 33416, bytes actually read = 18446744073709551615, offset = 0

I am not sure if this is due to any particular h5ad file. Would it be possible to upload subclass-level fragments bed files to catlas.org?

Best, Jay

beyondpie commented 9 months ago

@jayluo2 Sorry to hear that you still have the errors.

  1. Would you mind to share the whole script with me? Let me read your codes firstly.
  2. I will check if I can share you the h5ad files through Google Drive link (will let you know before this Friday), and you can later try that again. Ideally, they should be the same one, but let's just double check this.
  3. @wkl1990 Would you mind to upload the subclass-level bed files (if catlas is not easy, we can have a google drive link, and share that with @jayluo2 firstly).

Songpeng

wkl1990 commented 9 months ago

@beyondpie Apologies for the delay in uploading bed files. I will upload them to Catlas and notify you once it's completed.

jayluo2 commented 9 months ago

@beyondpie @wkl1990 Thank you both for your help, and that all sounds great!

@beyondpie Below is the code:

import os
import numpy as np
import pandas as pd
import snapatac2

# paths
cell_level_metadata_txt_path = os.path.join(‘path/to/Supplementary_Table_2.txt')
subclass_class_mapping_metadata_txt_path = os.path.join(‘path/to/AIT21_annotation_freeze_081523.tsv')
raw_anndata_dir = ‘/path/to/raw_anndata_files'
anndataset_dir = ‘/path/to/save/anndataset_object'
output_dir = ‘/path/to/output/fragments‘

# Load metadata (all 2.3M cells)
whole_mouse_brain_2023_metadata_df = pd.read_table(cell_level_metadata_txt_path)

# Get a list of absolute paths to sample h5ad files
adata_paths_l = [os.path.join(raw_anndata_dir, filename) for filename in raw_h5ad_filenames_l]
adatas_l = [(path.split('/')[-1].split('_rm_dlt.h5ad')[0], path) for path in adata_paths_l]

# Get AnnDataSet object
adataset = snapatac2.AnnDataSet(
    adatas=adatas_l,
    filename=anndataset_path
)

# Ensure ‘obs_names’ are unique
# use "." delimiter because that's what's done w/ cell-level metadata
unique_cell_ids = [sa + '.' + bc for sa, bc in zip(adataset.obs['sample'], adataset.obs_names)]
adataset.obs_names = unique_cell_ids
assert adataset.n_obs == np.unique(adataset.obs_names).size

# Get Cell-to-Subclass mapping
cellid_to_subclass_dict = dict(zip(whole_mouse_brain_2023_metadata_df['CellID'], whole_mouse_brain_2023_metadata_df['Subclass']))

# Create a Series from adataset.obs_names
cell_barcodes_series = pd.Series(adataset.obs_names)

# Map the ’Subclass' from 'cellid_to_subclass_dict' to the Series
subclass_labels = cell_barcodes_series.map(cellid_to_subclass_dict)

# Replace all spaces with underscores in the 'Class' column
subclass_labels = subclass_labels.str.replace(' ', '_', regex=False)
assert subclass_labels.isna().any() == False

# Add the class labels as a new column to adataset.obs
adataset.obs['Subclass'] = subclass_labels

# Export subclass-level fragments BED files
snapatac2.ex.export_bed(adataset,
                       groupby='Subclass',
                       suffix='.bed.gz',
                       out_dir = output_dir
                       )

Best, Jay

beyondpie commented 9 months ago

@jayluo2

  1. Here we also uploaded the hdf5 files (see supplementary file: GSE246791_RAW.tar) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246791. You can give this a try.

  2. I did not see apparent bugs in your codes. Previously I found that I cannot write some columns easily into SnapATAC2 AnnData/AnnDataSet. The groupby parameter supports an independent list: https://github.com/beyondpie/CEMBA_wmb_snATAC/blob/8d090924b7a75ef6bf12f6a6ee396092d9437559/03.peakcalling/src/main/python/prepare_bedfiles.py#L83 This is how I export the bed files. And your error is about reading the file, but you can red the file rightly if you work on that particular file. I don't know what happened. This looks strange to me. You can close the hdf5 files or close your Python REPL, and give your codes a try?

  3. @wkl1990 Thank you so much! Did you have any errors before for this task? Or you can share your codes to @jayluo2 to let him run yours?

Songpeng

wkl1990 commented 9 months ago

@beyondpie @jayluo2 The fragments in the subclass have been updated in Catlas. You can download them here: http://catlas.org/wholemousebrain/#!/downloads

jayluo2 commented 9 months ago

@wkl1990 @beyondpie Thank you so much! I will proceed with those files for now and close this issue, but would appreciate any updates regarding the h5ad files as well. Thank you both for all your help!

jayluo2 commented 9 months ago

@beyondpie @wkl1990 I see that there are 338 bed.gz fragments files in the uploaded directory. Could you please clarify why there are 275 cCRE files (which appear to be subclass level) but 338 subclass fragments files? I am specifically referring to this part of the paper: "Finally, 275 out of 338 subclasses were annotated to the 1,482 subtypes. This includes 253 out of 315 neuronal subclasses, covering 28 neuronal classes and 7 neurotransmitter types, as well as 22 out of 23 non-neuronal subclasses, covering 5 non-neuronal classes.”

My interpretation of this is that only 275 of the 338 scRNA-seq subclasses anchored properly to the snATAC-seq data. Is this correct? In that case, I presume the corresponding fragments files for these 275 subclasses are the most relevant files for downstream analyses?

Best, Jay

beyondpie commented 9 months ago

@jayluo2 Would you mind to check the number of the files? Kangli just add subclass_id for each of the file, so "338_blabla" only means the index not the number of files.

jayluo2 commented 9 months ago

@beyondpie I see 275 files as expected — thank you!