Error reading TAR_reads.bed file in calcHMMrefFlat

dkeitley commented 4 years ago

Hi,

I'm running into the following error when running the pipeline on my single-cell dataset:

rule calcHMMrefFlat:
    input: TAR_reads.bed.gz
    output: TAR_reads.bed.gz.noDir.refFlat.refFlat, TAR_reads.bed.gz.withDir.refFlat.refFlat
    jobid: 0

Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  no lines available in input
Calls: read.delim -> read.table

The TAR_reads.bed.gz file is empty which suggests that the problem lies in the calcHMMbed step.

I've set the following parameters in the config.yaml file which, except expectedCells, should be the same as the default.

GLOBAL:
    allowed_mismatch: 10 # outFilterMismatchNmax parameter for STAR alignment
    BC_range: # barcode range
        first: 1
        last: 16
    UMI_range: # unique molecular identifier range
        first: 17
        last: 28 # 28 for v3, up to 26 for v2

expectedCells: 25000
MERGEBP: 500
THRESH: 10000000

Any suggestions as to what might be causing the error?

Many thanks,

Dan

fw262 commented 4 years ago

Hi Dan,

Thank you for your message and I'm sorry you got that error.

Can you share the contents of the "SingleCellHMM_Run_combined_bam_HMM_features.log" file that the pipeline generates? It should be located in the same directory as your Snakefile. Also, did you try running the tool with the sample dataset provided first?

Best, Michael

dkeitley commented 4 years ago

Hi Michael,

I seem to be getting the same error when running on the sample dataset. Looking through the HMM features log file, it looks like part of the error was because I didn't have bedtools installed. I've now modified the SingleCell_HMM_MW.bash script to reference the bedtools install but I'm still getting a chr*HMM.bed file not found error.

Here's the HMM features log file:

Path to SingleCellHMM.R   TAR-scRNA-seq/scripts
INPUT_BAM                 combined_bam.bam
temp folder               combined_bam_HMM_features
number Of thread          2
minimum coverage        0
thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks
To avoid that, split reads into small blocks before input to groHMM
Spliting and sorting reads...

Start to run groHMM in each individual chromosome...
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features
TAR-scRNA-seq/combined_bam_HMM_features

Merging HMM blocks within 500bp...
sort: cannot read: chr*_HMM.bed: No such file or directory
mkdir: cannot create directory ‘toremove’: File exists

Calculating the coverage...
ERROR: Database file /dev/fd/63 contains chromosome 1, but the query file does not.
       Please rerun with the -g option for a genome file.
       See documentation for details.

Filtering the HMM blocks by coverage...

#### Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file ####

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to  combined_bam_HMM_features/toremove ...

combined_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and 
all major chromosomes are present in the final TAR_reads.bed.gz file

gzip: combined_bam_merge500.sorted.bed_count.gz already exists; not overwritten
gzip: combined_bam_split.sorted.bed.gz already has .gz suffix -- unchanged
gzip: toremove is a directory -- ignored
gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged
gzip: combined_bam_merge500.sorted.bed.gz already has .gz suffix -- unchanged
gzip: combined_bam_merge500.sorted.bed_count.gz already has .gz suffix -- unchanged
gzip: combined_bam_merge500.sorted.bed.gz already exists;   not overwritten

done!
gzip: chr*.bed.log.gz already has .gz suffix -- unchanged
gzip: chr*_HMM.bed_plus_merge500.gz already has .gz suffix -- unchanged
gzip: chr*_HMM.bed_minus.gz already has .gz suffix -- unchanged
gzip: chr*_HMM.bed.sorted.bed.gz already has .gz suffix -- unchanged
gzip: chr14.bed.log.gz already exists;  not overwritten
gzip: chr*_HMM.bed_plus.gz already has .gz suffix -- unchanged
gzip: chr*_HMM.bed_minus_merge500.gz already has .gz suffix -- unchanged
gzip: chr*_HMM.bed_plus_merge500.gz already exists; not overwritten
gzip: chr*_HMM.bed.sorted.bed.gz already exists;    not overwritten
gzip: chr*_HMM.bed_minus_merge500.gz already exists;    not overwritten
gzip: chr*_HMM.bed_minus.gz already exists; not overwritten
gzip: chr16.bed.log.gz already exists;  not overwritten
gzip: chr1.bed.log: No such file or directory
gzip: chr2.bed.log: No such file or directory
gzip: chr3.bed.log: No such file or directory
gzip: chr4.bed.log: No such file or directory
gzip: chr5.bed.log: No such file or directory
gzip: chr*_HMM.bed_plus.gz already exists;  not overwritten
gzip: chr*.bed.log.gz already exists;   not overwritten
gzip: chrMT.bed.log: No such file or directory

Thanks,

Dan

fw262 commented 4 years ago

Hi Dan,

The issue you are getting is likely due to the R package "rtracklayer" not being installed. I have made an update to the file scripts/SingleCellHMM.R that will automatically install rtracklayer and groHMM.

Please try re-running the pipeline on the test dataset with the updated scripts/SingleCellHMM.R file. The test dataset should take under 30 minutes to run. Once that works, you should be able to run the pipeline on your real dataset.

Thank you for bringing this issue to my attention! If you have any other issues, please let me know.

Best, Michael

dkeitley commented 4 years ago

Hi Michael,

Installing rtracklayer and groHMM has fixed the issue. I initially ran into a number of problems installing rtracklayer in my conda setup due to its XML dependency, but I managed to fix that by modifying the LD_LIBRARY_PATH (see here for anyone else having trouble with this).

Thanks a lot for your help!

Dan

fw262 / TAR-scRNA-seq

Error reading TAR_reads.bed file in calcHMMrefFlat #3