markphillippebworth commented 5 months ago

Hello! I've been trying to use your software, but I only ever see global CNV changes - no hint of single cell changes.

I've tried changing the windows size between 1e5, 5e5, and 1e6, the minFrags between 1000 to 1500, the minsizeCNV between 0 and 4, and yet I still get essentially identical resutls regardless. And the results don't really make sense to me. The data also comes from 10x, and I've been working with it for a while, so there's no significant issues with it.

Karyogram (10)

Code:

epiAneufinder(input= pathToArrow, #Enter path to your fragments.tsv file or the folder containing bam files outdir="epiAneufinder_results", #Path to the directory where results should be written blacklist="hg38-blacklist.v2.bed.gz",

Path to bed file that contains the blacklisted regions of your genome

          windowSize=1e6, 
          genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
          exclude=c('chrX','chrY','chrM'), 
          reuse.existing=FALSE,
          title_karyo="Karyogram of sample data", 
          ncores=4,
          minFrags=1500,
          minsizeCNV=0,
          k=4,
          plotKaryo=TRUE)

[1] "Removing old file from the output folder" Subtracting Blacklist... Adding Nucleotide Information... 1 of 22 2 of 22 3 of 22 4 of 22 5 of 22 6 of 22 7 of 22 8 of 22 9 of 22 10 of 22 11 of 22 12 of 22 13 of 22 14 of 22 15 of 22 16 of 22 17 of 22 18 of 22 19 of 22 20 of 22 21 of 22 22 of 22 [1] "Finished making windows successfully" [1] "Obtaining the fragments tsv file" |--------------------------------------------------| |==================================================| GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | barcode pcr

| [1] chr1 10066-10126 * | B038 8 [2] chr1 10067-10137 * | B038 2 [3] chr1 10067-10138 * | B038 13 [4] chr1 10068-10132 * | B038 1 [5] chr1 10071-10124 * | B038 1 [6] chr1 10071-10125 * | B038 5 ------- seqinfo: 24 sequences from an unspecified genome; no seqlengths NULL Getting Counts... Counting reads from fragments/bed file .. [1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds" Correcting for GC bias... [1] "Filtering empty windows, 4900 windows remain." [1] "Successfully identified breakpoints" A .tsv file with the results has been written to disk. It contains the copy number states for each cell per bin. 0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'. [1] "Successfully plotted karyogram" Warning message: In dir.create(outdir, recursive = TRUE) : 'epiAneufinder_results/epiAneufinder_results' already exists

thek71 commented 5 months ago

Hi,

the karyogram you have is of only one cell. The software does no produce global CNVs, it produces CNVs per cell. Also, in the messages that the software prints you can see that the count matrix that is calculated includes only one cell, hence the plot. [1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"

Are you using a fragment file as input or bam file? If it is bam file, then it should be one per cell. If you are using a fragments file, please there might be a formatting issue. Another possibility is that the minimum fragments variable is still too high for the dataset you have, although if you have set that to 1000 I find it unlikely.

Best, Katia

markphillippebworth commented 4 months ago

Thank you for the quick response! You're right - that is one cell.

I was using a fragments.tsv.gz file, and I don't know why, but epiAnuefinder wasn't recognizing the cell ID column in it. When I manually subsetting the fragment file to match the exact column names from the tutorial, then it worked - I think it was looking for the cell barcode column to be in the 4th column, and it wasn't, so it was treating all the cells are one big cell. Is there a way to tell it which column names are expected for cell barcodes or something like that? Otherwise, I'll just need to manually edit each file.

Thank you again! It's neat to see it working.

thek71 commented 4 months ago

Hi,

I am glad that it worked out. The way that epiAneufinder is designed, as you mentioned, is to parse the fragment file using the column content. In the standard 10x fragments the fourth column has the barcodes and we designed the algorithm to be compatible with the different types of data format specifications. I am not sure how your dataset looks like, but if it is a non-standard fragment format then unfortunately you will have to convert it first. Is your dataset coming from a different vendor that has a different format specification?

Best, Katia

markphillippebworth commented 4 months ago

We have our own scATAC-seq processing pipeline, which comes off of 10Xs, with additional steps. That being said, I've never had an issue with ArchR. It's possible that they rely on column names, instead of column positions, when parsing these files.

I am, however, having issues with the program getting stuck on some files. It'll run overnight, stuck at the GC bias. I'm not sure why.

thek71 commented 4 months ago

Hi,

the GC bias correction is probably the most intensive step computationally. For me depending on the dataset it might also take half a day, depending on the resources I give. Do you see an error or it is just taking too long? You can try with more cores.

Best, Katia

markphillippebworth commented 4 months ago

Ah ok - I was seeing that it was taking forever, more than half a day, on 10 cores. I'll try uping the resources and the window size, so it has less to iterate over.

markphillippebworth commented 4 months ago

I tried running it over the weekend with 45 cores - and it still stuck on GC correction. I've checked CPU usage, and it's been using all 45 cores for the last two days without even completing one file.

epiAneufinder(input= paste('FragmentFiles/',gsub(".*ArrowFiles/|.arrow", "", frag),'.tsv',sep=''),

Enter path to your fragments.tsv file or the folder containing bam files

              outdir= paste("Results/", gsub(".*ArrowFiles/|.arrow", "", frag) ,sep=''),
              #Path to the directory where results should be written 
              blacklist = "hg38-blacklist.v2.bed.gz",
                  #Path to bed file that contains the blacklisted regions of your genome
              windowSize=1e5, 
              genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
              exclude=c('chrX','chrY','chrM'), 
              reuse.existing=FALSE,
              title_karyo="Karyogram of sample data", 
              ncores=45,
              minFrags=1000,
              minsizeCNV=1,
              k=1,
              plotKaryo=TRUE)

Output: Counting reads from fragments/bed file .. [1] "Count matrix with 14940 cells and 26088 windows has been generated and will be saved as count_summary.rds"

thek71 commented 4 months ago

I can understand it, I am also running some samples and I am at day number 4. We are planning to make the GC correction step optional, but we first need to check how the results are going to be affected, before saying that we recommend that. It will take a bit of time, but I'll keep you posted.

Best, Katia

markphillippebworth commented 4 months ago

That would be great, thank you! I tried to re-write part of it from mclapply to parlapply, to help eliminate some memory leak issues when highly parallelizing (and runtime), but that hasn't helped unfortunately.

markphillippebworth commented 4 months ago

Thank you for that. I think it could make sense to skip, depending on the variance in GC content across the binsize. If the binsize is large enough, I would imagine the GC content variance would be minimal.

One alternative is switching from loess to lowess for the correction, which is known to be much faster over larger datasets (10s of thousands). Have you considered that implementation? I may try it out on my end and see if it spells things out. I have a close to 100 samples I'd like to process and I can't wait for 4 days per sample. https://stats.stackexchange.com/questions/161069/difference-between-loess-and-lowess#:~:text=lowess%20and%20loess%20count%20iterations,i.e.%2C%20iterations%3Diter%2B1

thek71 commented 4 months ago

Thank you for the recommendation. We can have a look and have it as s possible option for cases with large number of samples. Of course we will need again to test the performance both in terms of resources but also on overall performance.

Best, Katia

markphillippebworth commented 3 months ago

Is there a reason you did a loess fit for each cell individually when correcting for GC bias? Theoretically, all the cells would have been processed and sequenced together. You could just collapse all the cells together into one pseudobo sample, run your loess fit to generate your gc correction score once, and then apply that to all your cells evenly.

It's not like you're going to have cell-specific GC bias when sequencing.

thek71 commented 3 months ago

We are currently trying out different things in order to boost the performance. We have already tried skipping the GC correction, but the results compared to the groundtruth were far worse than what we would accept. We have some other things in mind to try. Thank you once again for your recommendation. We will add that in our list.

markphillippebworth commented 3 months ago

Sounds good! Please keep us abreast of what you figure out. I am curious what you considerd to be unacceptable deviations from ground truth.

I'm experimenting with doing one GC-correction across all cells within a given sample. I've branched your code and made those edits for it to run: https://github.com/aifimmunology/epiAneufinder/tree/optimized

If you want to test it out quickly, you can install from this branch and it should be ready to run. I just don't have the ability to easily benchmark it, but it runs much much faster.

KatharinaSchmid commented 3 months ago

Thanks for your suggestion with the pseudobulk GC correction factor, this sounds like a good alternative. I will test it and let you know how well it performs on our test datasets! Regarding your question about performance differences without GC correction: for the SNU601 dataset the correlation to the ground truth decreased from 86% to 46%. We don't have specific threshold what we would consider "an acceptable deviation", but this was clearly too bad.

markphillippebworth commented 2 months ago

Yes, I absolutely agree with you that 46% is an 'unacceptable' deviation. Good call, and thank you for sharing!

Right now, I'm testing our the method on PBMCs, but I don't have any ground truth for comparison. It looks pretty good though? I'm expecting some, but not too many alterations.

Whenever you can calculate accuracy metrics with pseudobulk GC correction, I would love to hear about it. I'd like to run your tool across several hundred scATAC-seq samples and analyze the results.

Here's an example run with Pseudobulk GC correction. epiAneufinder(input= frag, #Enter path to your fragments.tsv file or the folder containing bam files outdir=paste("Results/", gsub("FragmentFiles/|.tsv", "", frag) ,sep=''), blacklist="hg38-blacklist.v2.bed.gz", windowSize=5e5, genome="BSgenome.Hsapiens.UCSC.hg38", exclude=c('chrX','chrY','chrM'), title_karyo="Karyogram of sample data", ncores=10, minFrags=1000, minsizeCNV=1, k=1, plotKaryo=TRUE)

colomemaria / epiAneufinder

Only Identifies Global CNVs for scATAC data #25

Path to bed file that contains the blacklisted regions of your genome

Enter path to your fragments.tsv file or the folder containing bam files