beyondpie / CEMBA_wmb_snATAC

Whole mouse brain snATAC seq analysis
MIT License
23 stars 3 forks source link

provide bedpe files like 2021 format. #9

Closed beyondpie closed 4 months ago

jayluo2 commented 8 months ago

Hi @beyondpie, thank you for providing the BAM files! I would like to confirm two items:

Best, Jay

catlas2023_bam_md5sums.txt

beyondpie commented 8 months ago

Hi @jayluo2

jayluo2 commented 8 months ago

Hi @beyondpie, thanks again for the BAM files! I am also wondering if raw fragments.tsv.gz files (for each sample) can be provided, as this would facilitate regrouping of sample fragments into their corresponding cell types?

Update: I will try make_fragment_file() from SnapATAC2 first.

beyondpie commented 8 months ago

@jayluo2 Sure. I will give it a look! I recently have some personal things to handle, will reply to you next week (please feel free to let me know if you need other data, we can also have zoom meeting if needed). Thanks! Songpeng

jayluo2 commented 8 months ago

Hi @beyondpie, I am using the following command to generate fragments.tsv files, as in this script

stat = snapatac2.pp.make_fragment_file(bam_path,
                                    fragments_filename,
                                    is_paired = True,
                                    barcode_regex = "^(\w+):.+",
                                    shift_left = 4,
                                    shift_right = -4,
                                    min_mapq = 30
                                   )

When this command is run sequentially (per bam file), some fragments files are successfully generated with complete stats files, but ultimately produced this error

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 5, kind: Uncategorized, message: "Input/output error" }', /root/.cargo/git/checkouts/snapatac2-98bd0f054bf149f3/7154268/snapatac2-core/src/preprocessing/fragment.rs:141:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[10], line 2
      1 for ind, bam_path in enumerate(bam_filenames_absolute_path_l):
----> 2     make_fragments_file_all(bam_path)

Cell In[9], line 19, in make_fragments_file_all(input_path)
     17 t1 = time.time()
     18 input_list = [bam_file_absolute_path_i, fragments_filename_i]
---> 19 stat = make_fragments_file_i(input_list)
     21 with open(fragments_stats_file, 'w') as f:
     22     f.writelines(str(stat))

Cell In[8], line 5, in make_fragments_file_i(input_list)
      2 bam_absolute_path_i = input_list[0]
      3 fragments_filename = input_list[1]
----> 5 stat = snapatac2.pp.make_fragment_file(bam_absolute_path_i,
      6                                 fragments_filename,
      7                                 is_paired = True,
      8                                 barcode_regex = "^(\w+):.+",
      9                                 shift_left = 4,
     10                                 shift_right = -4,
     11                                 min_mapq = 30
     12                                )
     13 return stat

File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/preprocessing/_basic.py:88, in make_fragment_file(bam_file, output_file, is_paired, barcode_tag, barcode_regex, umi_tag, umi_regex, shift_left, shift_right, min_mapq, chunk_size)
     85 if barcode_tag is not None and barcode_regex is not None:
     86     raise ValueError("Only one of barcode_tag or barcode_regex can be set.")
---> 88 return internal.make_fragment_file(
     89     bam_file, output_file, is_paired, shift_left, shift_right, chunk_size,
     90     barcode_tag, barcode_regex, umi_tag, umi_regex, min_mapq,
     91 )

PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 5, kind: Uncategorized, message: "Input/output error" }

When the command is run in parallel (using joblib), I receive errors that appear to differ between sample bam files. For example, for CEMBA171206_3C, I get

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Os { code: 28, kind: StorageFull, message: "No space left on device" })', /root/.cargo/git/checkouts/snapatac2-98bd0f054bf149f3/7154268/snapatac2-core/src/preprocessing/mark_duplicates.rs:164:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "bam_to_fragments.py", line 55, in <module>
    make_fragments_file_all(bam_path)
  File "bam_to_fragments.py", line 38, in make_fragments_file_all
    stat = snapatac2.pp.make_fragment_file(bam_file_absolute_path_i,
  File "/home/groups/tttt/xjluo/miniconda3/envs/single_cell/lib/python3.8/site-packages/snapatac2/preprocessing/_basic.py", line 88, in make_fragment_file
    return internal.make_fragment_file(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Io(Os { code: 28, kind: StorageFull, message: "No space left on device" })

(the “No space left on device” is unexpected, since I am writing to a large storage repository)

but for CEMBA171219_4D I have

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /root/.cargo/git/checkouts/snapatac2-98bd0f054bf149f3/7154268/snapatac2-core/src/preprocessing/mark_duplicates.rs:426:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "bam_to_fragments.py", line 55, in <module>
    make_fragments_file_all(bam_path)
  File "bam_to_fragments.py", line 38, in make_fragments_file_all
    stat = snapatac2.pp.make_fragment_file(bam_file_absolute_path_i,
  File "/home/groups/tttt/xjluo/miniconda3/envs/single_cell/lib/python3.8/site-packages/snapatac2/preprocessing/_basic.py", line 88, in make_fragment_file
    return internal.make_fragment_file(
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

I’m not sure if these errors are specific to particular samples. I also noticed that running this code in a Git repository appeared to generated many gigabytes (> 400G) of content (I did not check the specific files/source of this) in the folder from which the script is run (not the folder to which fragments files are written). It’d be great to get feedback on this!

Alternatively, it’d be great to have uploaded fragments.tsv (or tsv.gz with corresponding tabix index files) files, as these errors may also be due to differences between our software environments.

Best, Jay

beyondpie commented 8 months ago

@jayluo2 Hi Jay, thank you so much for the information! I just finished some personal things, and went back to lab in this week. Will keep you updated in this week. Songpeng

beyondpie commented 8 months ago

Hi @beyondpie, thank you for providing the BAM files! I would like to confirm two items:

  • Are the BAM files deduplicated?
  • Could you please verify the checksums (md5sum) of the BAM files (attached)?

Best, Jay

catlas2023_bam_md5sums.txt

@jayluo2 I confirmed that the md5sums are same. I will remove the bams now, and check the question you. Thanks! Songpeng

jayluo2 commented 8 months ago

Thank you for confirming!

Best, Jay

beyondpie commented 7 months ago

@jayluo2

Songpeng

beyondpie commented 4 months ago

@jayluo2 should I close the issue? Songpeng

jayluo2 commented 4 months ago

Yes, thank you!