Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Empty summary files #20

Closed pwl closed 6 years ago

pwl commented 6 years ago

I run the whole pipeline (meta qc filter map extract) on one sample with 4M reads. Most of the pipeline run without any issues and then an R script threw an error near the end of the analysis. When I started debugging the scripts I found that all of the summary files are empty (apart from their headers).

You can find the reports here. What's surprising is that the results from qc look reasonable. Also, I'm no expert on bam files but they all seem pretty reasonable to me (I added the contents of the first 100 lines of the intermediate bam files converted to plain text with samtools view into the bamheads directory). You can find the full logs from running the pipeline in stderr.log and stdout.log. Perhaps I missed some option or specified a wrong threshold?

I used a standard pair of gtf/fasta files downloaded straight from GENCODE.

pwl commented 6 years ago

Perhaps related: the refFlat file generated by running the meta target turned out to be empty, so I updated my generate_meta.smk to fix that. Perhaps that's the source of the issue?

Hoohm commented 6 years ago

Yes, if you look at the sample1_final.txt in the headers, all your reads are tagged as XF:Z:INTERGENIC. Those are not counted as mapped to genes.

I'm pretty sure this is related to the modification of the refFlat file. Which genome and annotation files are you using?

pwl commented 6 years ago

I'm using the first gtf file from GENCODE, the "Comprehensive gene annotation" one for "CHR" regions.

EDIT: and for the genome I'm using the "Transcript sequences" from the same page.

Hoohm commented 6 years ago

Hi @pwl I think your issue comes from the transcripts.fa You should use Genome sequence (GRCh38.p10). I'm actually puzzled as to how it didn't throw an error (mine did) when using the transcripts.fa. Using the genome sequence, the refflat is not empty. Can you try again?

pwl commented 6 years ago

I'll try it, thanks. I also just realized that there are reference files (including .fa and .gtf files) on the drop-seq webpage in the "Data resources" section, I just haven't found them before and tried to use ones from GENCODE.

Hoohm commented 6 years ago

Do you have a mixed experiment? Mouse/human or only human? I'm using those DNA and gene sets GTF file

pwl commented 6 years ago

I tried it with the genome sequence GRCh38.p10 and it worked perfectly! Thank you!