data - Githubissues

gouxiaojuan commented 8 months ago

Hello, may I ask what process is used in this document https://www.nature.com/articles/s41586-023-06824-9#data-availability from fastq files to fragments and other files? Is there a corresponding tutorial? Thank you, because now I need files such as fragments provided by 10× company, but I did not find them in the data you provided. Or can you provide the fragment files of each sample? Thank you very much!

beyondpie commented 8 months ago

@gouxiaojuan Sorry, I just see your issue.

How to get fastq files to fragment and other files? i) We do alignment from FASTQ to bam file by using bwa with mm10. ii) Then we use SnapATAC2 to directly load bam files to get fragment file. You can read this: get fragment
Is there a tutorial for this? No. But I put all the codes in the directory of ~00.data.preprocess~, you can scan the scripts there. I did not organize this directory that well, feel free to let me know if you have further questions.
About fragments files.
- The data is not generated by 10x. But you should be able to generate the fragment file following my answer above. If you have troubles, just let me know.

Thanks! Songpeng

jayluo2 commented 7 months ago

Hi @beyondpie,

I am wondering if further filtering/processing of fragments files or raw BAM files were performed in addition to the two QC steps:

Number of unique fragments >= 1,000 and
TSS enrichment >= 10

which removed, as stated in the paper, "7% of nuclei that were deemed to be potential doublets”. I have generated class-level fragments files using sa2.pp.make_fragment_file() as in the script you posted above, and there seems to be more fragments per class in Supplementary Table 2 than in the class-level fragments files from pp.make_fragment_file(), which I used on the raw BAM files you uploaded earlier. Could you please clarify how exactly the “# of Fragments” column in Supplementary Table 2 was generated?

I have a few additional questions:

I also noticed that there is a “bam2bedpe” functionality here. Is this how you would recommend generating bedpe files from raw BAM files?
I noticed that for one cell (CEMBA200827_7H.ATGGTTTGGGCGCGACTTGAGA; and possibly others), the fragment end coordinate was smaller than the fragment start coordinate (below). I am wondering if this is due to internal processing of sa2.pp.make_fragment_file()? chrX 138383596 107973759 CEMBA200827_7H.ATGGTTTGGGCGCGACTTGAGA 5

Best, Jay

jayluo2 commented 7 months ago

@beyondpie When counting the number of occurrences of barcodes in the corresponding sample fragments file, I encounter off-by-one discrepancies:

Sanity check failed: CEMBA201210_10D.TGGTGCGCATGTACAACTCTAG. Metadata says 25209 but .tsv file says 25208

Sanity check failed: CEMBA181023_6B.AAGCAAAGTCACTCTTCCTCAT. Metadata says 6311 but .tsv file says 6312

Thanks, Jay

beyondpie commented 7 months ago

@jayluo2

I also noticed that there is a “bam2bedpe” functionality here. Is this how you would recommend generating bedpe files from raw BAM files?

My college previously generated the bedpe files using the codes here: https://github.com/beyondpie/CEMBA_wmb_snATAC/blob/543bf5c73a6f34638bfcdff8fab9400d391598ae/00.data.preprocess/src/main/pipeline/alignment.Snakefile#L255 I haven't run this part, and if you still need the bedpe files, I would suggest you follow the codes here. And if you have problems, let's have a discussion then.

Sincerely, Songpeng

jayluo2 commented 7 months ago

@beyondpie Either fragments files (but shifted slightly differently, as we discussed) or bedpe files will work for me. Since sa2.pp.make_fragment_file() already has functionality to change fragment start/end shifting, perhaps we can stick with fragments files for now?

Best, Jay

beyondpie commented 6 months ago

@jayluo2 I now close this issue. If you have further questions, just let me know. Thanks! Songpeng

beyondpie / CEMBA_wmb_snATAC

data #10