LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
69 stars 24 forks source link

Running multiple instances of scPipe #115

Closed bhavnah closed 5 years ago

bhavnah commented 5 years ago

Hi,

I would like to run multiple instances of scPipe on one sample with multiple libraries. I am using one of the HCA datasets (Ischaemic Sensitivity of Human Tissue; I believe you used this data as an example in your paper?) as a test. With this dataset I will end up with 4 gene count tables. Should I then merge the tables into a single one that will still be able to discriminate between the different libraries of the same sample? Or should I instead sum the gene counts based on cell barcode information?

I will be processing a much larger dataset in the near future, and I would like to submit array jobs on a computing cluster, so being able to run multiple instances of scPipe will greatly speed things up.

Thanks,

B.

LuyiTian commented 5 years ago

If what you mean is differen lanes in Illumina flowcel, then it is not recomended to run multiple fastq file separately. Because they contains the same cells, and the gene count table contains repeatitive UMIs and simple add them up will over count the UMI. scPipe should work for a single 10X sequenceing run, which contains about 5000~ cells and 400M reads in one full NEXTseq run. Of course if you did 8 samples in one 10X library prep then you can run 8 scPipe jobs separately since they are different cells.

if you really want to spead things up and you have enough RAM and disk space. you can do the following: split fq and run sc_trim_barcode separately on each one run alignement on each fastq. you merge all the fastq and bam files and run sc_detect_bc on merged fq to detect cell barcode, say you have 8000 barcode. you can split the cell barcode annotation csv into, for example, 8 csv where each contains 1000 barcode. then you run sc_count_aligned_bam using the merged bam on each csv that contains 1000 cell barcode. in the end you can merge your gene count matrix.

In my opinion you dont need to make your life harder like this, scPipe should be fine with most single 10X run, unless you have 500,000 cells, but it is unlikely to be generated by a single 10X library prep and you can always run them individually.

bhavnah commented 5 years ago

Thanks Luyi. So if I understand correctly, I should process one sample as a whole, and I can do this in 2 ways:

  1. if this sample was sequenced over multiple channels and multiple lanes, I can concatenate all the reads and then process them together?
  2. Or I can run sc_trim_barcode and Rsubread::align on each fastq file of the sample, merge the combined.fastq files and BAM files prior to running sc_detect_bc, split the barcode_anno.csv file before running sc_count_aligned_bam and merge the individual gene matrices?

I am looking at the other HCA dataset (Census of Immune Cells); one of the datasets (cord blood) was prepared as follows:

If I were to process this dataset, I can run 8 instances of scPipe in parallel, and for each instance I could merge the reads prior to processing as explained in 1.? I can also run each instance as described in 2.?

Thanks!

B.

LuyiTian commented 5 years ago

yes you are right, but for each instance, if there are just 4000 cells I dont think you need to use 2. to speed up. They are I/O intensive so I dont think you will get much reduce in running time.

bhavnah commented 5 years ago

So for the HCA data, there are 8 samples, with 10 channels each. Each channel has 4000 cells. Does that mean I can potentially run 10 instances for one sample, and merge the gene matrices at the end?

LuyiTian commented 5 years ago

emm I did not read the HCA manual on the data. if each channel have 4000 cells and these cells are different for each channel, then you should not merge them. if they are the same cells, just sequenced multiple times to get more reads then you should merge them.

bhavnah commented 5 years ago

I think the cells are different as it says 'For each donor we prepared 8 independent 10X channels'. But I am not sure either. The information is here: https://s3.amazonaws.com/preview-ica-expression-data/Brief+ICA+Read+Me.pdf

Would you mind telling me how I should go about processing this data? Just the cord blood data for e.g. I am going to process a similar dataset using scPipe soon, so I want to make sure I am doing the right thing...

Thanks!!

LuyiTian commented 5 years ago

I think just run scPipe on each channel. please indicate different prefix in the sc_detect_bc for different channels so they can be merged without duplicated cell names. also dont forget to specify different output folder for each channel, some output file names are the same same they will overwrite if you specify the same folder.

bhavnah commented 5 years ago

Thank you. I will give it a try and see how I go.