fw262 / TAR-scRNA-seq

scRNA-seq analysis beyond gene annotations using transcriptionally active regions (TARs) generated from sequence alignment data
GNU General Public License v3.0
9 stars 7 forks source link

fail to create scTAR_cellranger env #10

Open adonatiucsd opened 2 years ago

adonatiucsd commented 2 years ago

Hi, I am trying to get the fromcellranger pipeline to work on MacOS Catalina 10.15.7 When I run conda env create -f scTAR_cellranger.yml (after modifying the yml file prefix to /home/miniconda3/envs/scTAR_cellranger) I get the following error message:

Collecting package metadata (repodata.json): done Solving environment: failed

ResolvePackageNotFound:


any idea about what is wrong? I tried to run snakemake without creating the scTAR_cellranger environment but I get a bunch of errors for example after doing conda activate snakemake snakemake -j I get: The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs. Building DAG of jobs... MissingInputException in line 99 of /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/Snakefile: Missing input files for rule convertToRefFlat: /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf

thanks!

Antoine

fw262 commented 2 years ago

Hi Antoine,

The ResolvePackageNotFound is a known issue if you're trying to create the environment on Mac using an environment.yml file created on a different OS, in this case Linux. I suggest running this pipeline either in Linux or try running without creating the scTAR_cellranger environment.

It looks like the error you're getting is related to your /Users/perrylabmac/Musca_ref_genome_Cellranger folder. Can you make sure this folder contains genes/genes.gtf?

Best, Michael

adonatiucsd commented 2 years ago

Hi Michael,

It turns out the genes.gtf was zipped; after unzipping and rerunning the snakemake I got this error:

Error in rule convertToRefFlat: jobid: 9 output: /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat shell:

            /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp
            paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
            rm refFlat.tmp

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

thanks a lot for your help,

Antoine

fw262 commented 2 years ago

Can you share the error message you get when you run the following commands on the command line? The snakemake log file does not provide the error message info.

/Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp
            paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
            rm refFlat.tmp

Thanks, Michael

adonatiucsd commented 2 years ago

I get this (snakemake) Perrys-MacBook-Pro:log perrylabmac$ /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp dyld: Library not loaded: @rpath/libssl.1.1.dylib Referenced from: /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred Reason: image not found Abort trap: 6 (snakemake) Perrys-MacBook-Pro:log perrylabmac$ paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat cut: cut: refFlat.tmp: No such file or directory refFlat.tmp: No such file or directory (snakemake) Perrys-MacBook-Pro:log perrylabmac$ rm refFlat.tmp rm: refFlat.tmp: No such file or directory

fw262 commented 2 years ago

Hmm, seems like an issue with the libssl library. Try conda installing with conda install libssh2 and re-runing.

Michael

adonatiucsd commented 2 years ago

So I installed libssh2 and rerun, here is the log file: Select jobs to execute...

[Fri Oct 8 10:07:32 2021] rule calcHMMbed: input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz jobid: 8 wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry threads: 3 resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

[Fri Oct 8 12:18:55 2021] Finished job 8. 1 of 11 steps (9%) done Select jobs to execute...

[Fri Oct 8 12:18:55 2021] rule calcHMMrefFlat: input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Ce$ output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat jobid: 7 wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

[Fri Oct 8 12:18:55 2021] Error in rule calcHMMrefFlat: jobid: 7 output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat shell:

            Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/per$

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-08T100732.520897.snakemake.log

and what I get during execution:

(snakemake) Perrys-MacBook-Pro:from_cellranger perrylabmac$ snakemake -j --rerun-incomplete The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs. Building DAG of jobs... Using shell: /usr/local/bin/bash Provided cores: 8 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


HMM_refFlat_to_gtf_WITHDIR 1 1 1 all 1 1 1 calcHMMbed 1 3 3 calcHMMrefFlat 1 1 1 extract_HMM_expression_withDir 1 1 1 getDiffFeatures 1 1 1 getDiffRegionsToBlast 1 1 1 getDiffSeqsToBlastFa 1 1 1 labelDiffuTARs 1 1 1 ruleBlast 1 3 3 stage3_withDir 1 1 1 total 11 1 3

Select jobs to execute...

[Fri Oct 8 10:07:32 2021] rule calcHMMbed: input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz jobid: 8 wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry threads: 3 resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Number of aligned reads is 594024319 mkdir: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features: File exists tee: SingleCellHMMRun/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features.log: No such file or directory Path to SingleCellHMM.R /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/scripts INPUT_BAM /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam cellranger count folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR tmp folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features number Of thread 3 minimum coverage 59 thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks To avoid that, split reads into small blocks before input to groHMM Spliting and sorting reads... awk: syntax error at source line 1 context is {print $0 >> >>> "chr"$ <<< 1".bed"} awk: illegal statement at source line 1 zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory find: illegal option -- n usage: find [-H | -L | -P] [-EXdsx] [-f path] path ... [expression] find [-H | -L | -P] [-EXdsx] -f path [path ...] [expression]

Start to run groHMM on each individual chromosome... chr*.bed

Merging HMM blocks within 500bp... sort: No such file or directory mkdir: toremove: File exists

Calculating the coverage... zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

zcat: can't stat: TAR_reads.bed.gz (TAR_reads.bed.gz.Z): No such file or directory

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and all major chromosomes are present in the final TAR_reads.bed.gz file

gzip: toremove is a directory

done! gzip: possorted_genome_bam_split.sorted.bed.gz already has .gz suffix -- unchanged gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged [Fri Oct 8 12:18:55 2021] Finished job 8. 1 of 11 steps (9%) done Select jobs to execute...

[Fri Oct 8 12:18:55 2021] rule calcHMMrefFlat: input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat jobid: 7 wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: read.delim -> read.table Execution halted [Fri Oct 8 12:18:55 2021] Error in rule calcHMMrefFlat: jobid: 7 output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat shell:

    Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-08T100732.520897.snakemake.log

thanks!

Antoine

fw262 commented 2 years ago

Hi Antoine,

I think this issue may again be related to your MacOS system. The awk syntax seem to be a little different depending on the OS. I propose 2 temporary solutions you can try.

  1. Try using gawk instead of awk (command: alias awk=gawk).
  2. Go into the file scripts/SingleCellHMM_MW.bash and change the line bedtools bamtobed -i ${INPUT_BAM} -split |LC_ALL=C sort -k1,1V -k2,2n --parallel=30| awk '{print $0}' | gzip > ${TMPDIR}/${PREFIX}_split.sorted.bed.gz into bedtools bamtobed -i ${INPUT_BAM} -split |LC_ALL=C sort -k1,1V -k2,2n --parallel=30| awk '{print ($0)}' | gzip > ${TMPDIR}/${PREFIX}_split.sorted.bed.gz where there is a parenthesis added to the awk print command.

Let me know if that fixes it.

Michael

adonatiucsd commented 2 years ago

Hi Michael,

I tried both, I still get this error:

The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs. Building DAG of jobs... Using shell: /usr/local/bin/bash Provided cores: 8 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


HMM_refFlat_to_gtf_WITHDIR 1 1 1 all 1 1 1 calcHMMrefFlat 1 1 1 extract_HMM_expression_withDir 1 1 1 getDiffFeatures 1 1 1 getDiffRegionsToBlast 1 1 1 getDiffSeqsToBlastFa 1 1 1 labelDiffuTARs 1 1 1 ruleBlast 1 3 3 stage3_withDir 1 1 1 total 10 1 3

Select jobs to execute...

[Tue Oct 12 09:22:09 2021] rule calcHMMrefFlat: input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat jobid: 7 wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: read.delim -> read.table Execution halted [Tue Oct 12 09:22:09 2021] Error in rule calcHMMrefFlat: jobid: 7 output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat shell:

    Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-12T092208.761573.snakemake.log

Thanks!

Antoine

fw262 commented 2 years ago

Can you share the content of your /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR directory with a ls -hlt command? It looks like you have an empty file somewhere along the TAR discover step.

Thanks, Michael

adonatiucsd commented 2 years ago

Hi!

(base) Perrys-MacBook-Pro:TAR perrylabmac$ ls -hlt total 0 lrwxr-xr-x 1 perrylabmac staff 123B Oct 12 15:23 TAR_reads.bed.gz -> /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/TAR_reads.bed.gz drwxr-xr-x 7 perrylabmac staff 224B Oct 12 15:23 possorted_genome_bam_HMM_features

and the content of the inner folder:

(base) Perrys-MacBook-Pro:possorted_genome_bam_HMM_features perrylabmac$ ls -hlt total 11993240 drwxr-xr-x 2 perrylabmac staff 64B Oct 12 15:23 toremove -rw-r--r-- 1 perrylabmac staff 20B Oct 12 15:23 TAR_reads.bed.gz -rw-r--r-- 1 perrylabmac staff 67B Oct 12 15:23 possorted_genome_bam_merge500.sorted.bed_count.gz -rw-r--r-- 1 perrylabmac staff 61B Oct 12 15:23 possorted_genome_bam_merge500.sorted.bed.gz -rw-r--r-- 1 perrylabmac staff 5.7G Oct 12 15:23 possorted_genome_bam_split.sorted.bed.gz

thanks,

Antoine

fw262 commented 2 years ago

Can you make sure you have the necessary R packages, listed below, installed?

BiocManager rtracklayer groHMM Seurat, version >= 4.0 data.table dplyr stringr

Can you also share what is inside the toremove folder with ls -lht?

Thanks, michael

adonatiucsd commented 2 years ago

Hi,

So I already had these R packages installed via Rstudio but I reinstalled them within miniconda3 just to make sure; I get the same error message as before. The toremove folder is empty; nothing shows up when I type ls -lht

thanks,

Antoine

fw262 commented 2 years ago

Can you share the SingleCellHMM_Run*.log file generated after implementing the awk fix? I would also recommend running the pipeline on a Linux system if possible.

Michael

adonatiucsd commented 2 years ago

ok I will try to run this on Linux. Where can I find the SingleCellHMM log file?

fw262 commented 2 years ago

I made a change to scripts/SingleCellHMM_MW.bash to fix a bug with the TAR discovery log file. If you update this file and re-run your pipeline, you should find the log file in /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/SingleCellHMM_Run.log.

Michael

adonatiucsd commented 2 years ago

Here it is:

Path to SingleCellHMM.R /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/scripts INPUT_BAM /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam cellranger count folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR tmp folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features number Of thread 10 minimum coverage 59 thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks To avoid that, split reads into small blocks before input to groHMM Spliting and sorting reads... zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory awk: syntax error at source line 1 context is {print $0 >> >>> "chr"$ <<< 1".bed"} awk: illegal statement at source line 1 find: illegal option -- n usage: find [-H | -L | -P] [-EXdsx] [-f path] path ... [expression] find [-H | -L | -P] [-EXdsx] -f path [path ...] [expression]

Start to run groHMM on each individual chromosome... chr*.bed

Merging HMM blocks within 500bp... sort: No such file or directory

Calculating the coverage... zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

zcat: can't stat: TAR_reads.bed.gz (TAR_reads.bed.gz.Z): No such file or directory

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and all major chromosomes are present in the final TAR_reads.bed.gz file

adonatiucsd commented 2 years ago

Hi,

So I still haven't managed to get it to work on MacOS but I was able to run the pipeline on Linux. However when loading the TAR bed file on IGV I noticed that many scaffolds did not have anything in the TAR track; and from the groHMM log file it looks like the program did not go through all the 20487 scaffold of the Housefly genome:

Path to SingleCellHMM.R /home/adonati/TAR-scRNA-seq/from_cellranger/scripts INPUT_BAM /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam cellranger count folder /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR tmp folder /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features number Of thread 6 minimum coverage 59 thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks To avoid that, split reads into small blocks before input to groHMM Spliting and sorting reads... Finished splitting input bam. awk: cannot open "chrNW_004756140.1.bed" for output (Too many open files)

Start to run groHMM on each individual chromosome... chrNW_004754939.1.bed chrNW_004754940.1.bed chrNW_004754941.1.bed chrNW_004754943.1.bed chrNW_004754965.1.bed chrNW_004755053.1.bed chrNW_004755054.1.bed chrNW_004755111.1.bed chrNW_004755164.1.bed chrNW_004755198.1.bed chrNW_004755220.1.bed chrNW_004755231.1.bed chrNW_004755318.1.bed chrNW_004755333.1.bed chrNW_004755352.1.bed chrNW_004755353.1.bed chrNW_004755386.1.bed chrNW_004755398.1.bed chrNW_004755497.1.bed chrNW_004755531.1.bed chrNW_004755586.1.bed chrNW_004755622.1.bed chrNW_004755631.1.bed chrNW_004755653.1.bed chrNW_004755684.1.bed chrNW_004755784.1.bed chrNW_004755797.1.bed chrNW_004755819.1.bed chrNW_004755838.1.bed chrNW_004755853.1.bed chrNW_004755919.1.bed chrNW_004755924.1.bed chrNW_004755942.1.bed chrNW_004755953.1.bed chrNW_004756008.1.bed chrNW_004756052.1.bed chrNW_004756054.1.bed chrNW_004756076.1.bed chrNW_004756109.1.bed chrNW_004756120.1.bed

Merging HMM blocks within 500bp...

Calculating the coverage...

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

NW_004754939.1 NW_004754940.1 NW_004754941.1 NW_004754943.1 NW_004754965.1 NW_004755053.1 NW_004755054.1 NW_004755164.1 NW_004755198.1 NW_004755220.1 NW_004755231.1 NW_004755353.1 NW_004755386.1 NW_004755398.1 NW_004755497.1 NW_004755531.1 NW_004755586.1 NW_004755631.1 NW_004755653.1 NW_004755797.1 NW_004755819.1 NW_004755838.1 NW_004755853.1 NW_004755919.1 NW_004755924.1 NW_004755942.1 NW_004755953.1 NW_004756008.1 NW_004756052.1 NW_004756054.1 NW_004756076.1 NW_004756109.1

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and all major chromosomes are present in the final TAR_reads.bed.gz file

gzip: possorted_genome_bam_split.sorted.bed.gz already has .gz suffix -- unchanged gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged

Looking at the TAR_reads. bed it seems many scaffold are absent indeed. Any idea why the program skipped so many scaffolds?

Thanks!

Antoine

fw262 commented 2 years ago

Hi Antoine,

I'm glad the pipeline worked better on Linux. I updated scripts/SingleCellHMM_MW.bash so please try re-running with this new script. With over 20k scaffolds, it's overloading memory so I modified that script to close bed files that are generated. This may take a little longer to run but should work.

Best, Michael

adonatiucsd commented 2 years ago

Hi Michael,

I am trying to get the pipeline to work on UCSD supercomputer Expanse, I installed miniconda3 and all the software required except R, which I can't figure out how to install without sudo (which I can't use on Expanse); do you think the pipeline could work if I install R via conda?

Thanks!

Antoine

fw262 commented 2 years ago

A conda installed R should work fine, as long as you have all the required R packages as well. Let me know if it works out.

Michael

adonatiucsd commented 2 years ago

Hi Michael,

Sorry about the stupid question, but am I supposed to activate the scTAR_cellranger environment before running snakemake? I do have R and installed all the required libraries but within the scTAR_cellranger environement. The problem is that to run snakemake I need to use "conda activate snakemake", which deactivates scTAR_cellranger... Thanks a lot!

Antoine

adonatiucsd commented 2 years ago

Hi Michael,

I get this error when I run the pipeline after installing R with conda:

wildcards: CR_REF=/expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/Musca_ref_genome resources: tmpdir=/tmp

[Sat Oct 23 18:45:20 2021] Finished job 9. 1 of 11 steps (9%) done Select jobs to execute...

[Sat Oct 23 18:45:20 2021] rule calcHMMrefFlat: input: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /expanse/lustre/scratch/adon$ output: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat jobid: 7 wildcards: DATADIR=/expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry resources: tmpdir=/tmp

/usr/bin/bash: line 1: Rscript: command not found [Sat Oct 23 18:45:20 2021] Error in rule calcHMMrefFlat: jobid: 7 output: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat shell:

        Rscript scripts/generate_refFlat_script_both.R /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/Musca_ref_genome/genes/genes.refFlat /expa$

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/adonati2/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-23T184509.175761.snakemake.log

it looks like the Rscript command is not working from within the snakemake pipeline even if I do have Rscript in miniconda3/bin...

thanks!

Antoine

fw262 commented 2 years ago

Hi Antoine,

It looks like Rscript is not available within the snakemake environment. You can try simply replacing Rscript command within the rule calcHMMrefFlat of Snakefile to explicitly call Rscript in miniconda3/bin.

In regards to the snakemake environment, it may be worthwhile to install snakemake via pip so you can activate the scTAR_cellranger environment without conflicting with the snakemake environment.

Best, Michael

adonatiucsd commented 2 years ago

Hi Michael,

I had a problem with R because for some reason conda installed R 3.2 I managed to install R 4.1 with mamba and then I was able to run the pipeline It ran for a full 24h on the computer before stopping before completion (max time reached): it created tones of .bed files (looks like one per scaffold) in /TAR/possorted_genome_bam_HMM_features along with "possorted_genome_bam_split.sorted.bed.gz" and "SingleCellHMM_Run.log" which reads like:

Path to SingleCellHMM.R /home/adonati2/TAR-scRNA-seq/from_cellranger/scripts INPUT_BAM /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_0225$ cellranger count folder /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_0225$ tmp folder /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_0225$ number Of thread 24 minimum coverage 59 thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks To avoid that, split reads into small blocks before input to groHMM Spliting and sorting reads... Finished splitting input bam.

I was wondering if the pipeline is really supposed to generate and keep all these .bed files? Is it normal that the pipeline took so long even running on Expanse (I ran it on just one Node, which is two 64-core AMD EPYC 7742 processors and contain 256 GB of DDR4 memory)? Thanks!

Antoine

fw262 commented 2 years ago

Hi Antoine,

Yes, the pipeline is suppose to generate those bed files. The groHMM algorithm is run on individual chromosomes/scaffolds. There is a limitation with the tool if you have a very high number of scaffolds (over 20,000 in your case). Are you interested in the expression across all scaffolds? Perhaps you can filter for the major scaffolds where you'd expect the most expression.

Best, Michael

adonatiucsd commented 2 years ago

Hi Michael,

I guess it would be nice to have the TARs for all scaffold; here I have many scaffolds indeed but most of them are small, only 35 of them are over 1Mb, and the total genome size is only 750Mb; will the pipeline run slower with a more fragmented genome or is the speed only dependent on genome size?

fw262 commented 2 years ago

Hi Antoine,

The pipeline would run slower with more fragmented genome. It wouldn't take as long if you ran the pipeline on, for example, the human or mouse genome.

Best, Michael

adonatiucsd commented 2 years ago

Hi Michael,

Are all these bedfiles deleted at the end of the pipeline or are they all merged together and zipped? thanks

Antoine

fw262 commented 2 years ago

Hi Antoine,

These bed files are zipped, not merged, and moved to the possorted_genome_bam_HMM_features/toremove folder which you can remove manually.

Best, Michael