fw262 / TAR-scRNA-seq

scRNA-seq analysis beyond gene annotations using transcriptionally active regions (TARs) generated from sequence alignment data
GNU General Public License v3.0
9 stars 7 forks source link

Issues with pwd in bash script #2

Closed 14zac2 closed 4 years ago

14zac2 commented 4 years ago

Hi!

I'm having a little trouble getting your pipeline to run on my computer and I was wondering if you could tweak a couple things to make your code a bit more robust just in case somebody runs into similar issues (I also have yet to successfully make it through the entire pipeline, just so you know! Still fixing issues).

A significant and yet trivial-seeming issue is that I do most of my work in dropbox which has spaces in their file folder name (e.g. "Dropbox (XXX Lab)" is a directory). This brings up the issue in your script SingleCellHMM_MW.bash in line 18: ${PL:=${CURDIR}/scripts} The variable PL becomes incomplete when dealing with these spaces. Perhaps it would be possible to make a version of the script that can deal with spaces in the directory name? I realize the best solution would be to have no spaces in the Dropbox directory, but alas... Unfortunately this spacing issue causes the program to crash whenever pwd or the variable $PL is called.

Another small thing, is that the pipeline crashed once because I had not installed Bedtools. Do you think it would be possible to add this to your list of required tools to have installed before running the pipeline?

Considering the length of time this pipeline takes to run, I have also been running snakemake --rerun-incomplete without deleting the intermediate files. I am not entirely clear whether or not this is messing up anything in the pipeline - are there checkpoints where you would recommend deleting specific files or directories before attempting to rerun from where the pipeline crashed? I realize there may be a long and complicated answer to this question.

Thanks so much for your help! Zoe

fw262 commented 4 years ago

Hi Zoe,

Thank you for the suggestion! I did not think that having spaces in the file folder name would be an issue but I will incorporate that possibility in the next update.

Running 'snakemake --rerun-incomplete' should not mess up anything. Do you know where the program crashed? Every time you run the 'snakemake' command, the program generates a log file that is stored in './.snakemake/log' where it will provide information about which step the pipeline fails at.

The snakemake pipeline also allows you to run up to specific rule with 'snakemake -R {RULE_NAME}'. For example, you can run the pipeline up to the STAR aligned bam file with 'snakemake -R STAR_align'. You can also edit the Snakefile and change the final output that you want by modifying the 'rule all' output file (line 48).

Let me know if you have any more questions!

Best, Michael

14zac2 commented 4 years ago

Hi Michael,

Thanks so much for your response! This clarifies things. I can certainly let you know where the program is crashing! I thought it might be a problem with my gtf file, beginning partway through the pipeline, or having a directory with spaces, but I edited my gtf file and reran the entire thing in a different directory this morning and the program has crashed with the same issues.

The pipeline currently fails during rule calcHMMbed where I get the following errors:

The following is where my shell seems to be interpreting the first few lines of the SingleCellHMM_MW.bash script as commands when I'm pretty sure numbers are to be assigned as variables: scripts/SingleCellHMM_MW.bash: line 10: 20: command not found scripts/SingleCellHMM_MW.bash: line 12: 500: command not found scripts/SingleCellHMM_MW.bash: line 13: 10000000: command not found scripts/SingleCellHMM_MW.bash: line 18: /home/zclarke/Documents/TAR-scRNA-seq/scripts: Is a directory

A few lines later, I get this error: Reads spanning over splicing junction will join HMM blocks To avoid that, split reads into small blocks before input to groHMM Spliting and sorting reads... awk: cannot open "chrWCK01_AAF20200214_F8-ctg1022.bed" for output (Too many open files)

After groHMM runs in each individual chromosome, I get this error: Merging HMM blocks within 500bp... sort: cannot read: 'chr*_HMM.bed': No such file or directory

Calculating the coverage... ERROR: Database file /dev/fd/63 contains chromosome WCK01_AAF20200214_F8-ctg1, but the query file does not. Please rerun with the -g option for a genome file. See documentation for details.

And then finally: gzip: combined_bam_split.sorted.bed.gz already has .gz suffix -- unchanged gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged gzip: toremove is a directory -- ignored

Then for the next rule, rule calcHMMrefFlat, I get this error: Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: read.delim -> read.table Execution halted but it's probably due to the errors that occurred in the previous rule.

I have also attached the log file in case it is helpful. Do you know where things might be going wrong here?

Thanks so much! Zoe 2020-08-24T042855.361211.snakemake.log

fw262 commented 4 years ago

Hi Zoe,

Thank you for providing more detail about your problem. From the first error message awk: 'cannot open "chrWCK01_AAF20200214_F8-ctg1022.bed" for output (Too many open files)', it seems that the computer you are running this on does not have enough RAM.

How much memory do you have on the computer or cluster that you are using to run this pipeline? How big is the input bam file "combined_bam.bam"?

Thanks, Michael

14zac2 commented 4 years ago

Hi Michael,

Thanks for your quick response! "combined_bam.bam" is ~15G and I have 62G RAM with 15G free right now. I also have the option of running this on any of the ComputeCanada, but they often have long wait times for job submissions so I was trying to see if I could get this to work locally.

Thanks, Zoe

fw262 commented 4 years ago

Hi Zoe,

15Gb of RAM won't be enough to process 15G bam file. I would recommend running the pipeline on a cluster with more RAM.

If you really want to run it locally, I would try down-sampling the bam file with something like 'samtools view -s 0.1 -b combined_bam.bam > combined_bam.bam". This randomly sub-sample 10% of the reads so you end up with a smaller 1.5GB bam file. With this, you can at least see if the pipeline works.

Thanks, Michael

14zac2 commented 4 years ago

Hi Michael,

No problem, I'll try it on a cluster! Do you have a rough idea of how much RAM the job will require? I can also figure this out through trial and error.

Do you think solving the memory issue will also solve the issue of the chromosome not appearing in the query file?

Thanks again! Zoe

fw262 commented 4 years ago

Hi Zoe,

I'm not sure of the exact number but you may need up to 100GB of memory with a 15GB bam file. It will only use up to 100GB of memory during that particular step in the pipeline. I think the chromosome issue is related to the previous memory issue as well.

There is also a small test set, in the testData_small folder, that you can run to make sure everything works.

Thanks, Michael

14zac2 commented 4 years ago

Fantastic, thanks Michael! I appreciate all of your help on this. I'll transfer my data to the cluster and let you know if I encounter further issues.

Cheers, Zoe

14zac2 commented 4 years ago

Hi Michael,

I just tried running the pipeline on the cluster, and I'm getting the same errors. I'm concerned about the following error which is still the first error to occur before everything crashes in rule calcHMMbed:

scripts/SingleCellHMM_MW.bash: line 10: 30: command not found scripts/SingleCellHMM_MW.bash: line 12: 500: command not found scripts/SingleCellHMM_MW.bash: line 13: 10000000: command not found scripts/SingleCellHMM_MW.bash: line 18: /scratch/zocla/TAR-scRNA-seq/scripts: Is a directory

It just seems like the script is getting interpreted incorrectly, but I'm not sure why there would be a syntax error?

Thanks again for you insight! Zoe

EDIT: Sorry, the errors are slightly different. I am no longer getting the error: awk: cannot open "chrWCK01_AAF20200214_F8-ctg1022.bed" for output (Too many open files) instead I am getting: sort: write failed: combined_bam_HMM_features/sortIvY4R4: Input/output error at the same position, but the rest of the errors appear to be the same.

fw262 commented 4 years ago

Hi Zoe,

Sorry this didn't work for you. The first 4 "error" are actually warnings. I've updated the scripts/SingleCellHMM_MW.bash file to address those warnings.

The error that seems to kill the pipeline is "sort: write failed: combined_bam_HMM_features/sortIvY4R4: Input/output error". Do you mind sharing the "SingleCellHMM_Run_combined_bam_HMM_features.log" that should be in the same folder as your Snakefile?

Thanks, Michael

14zac2 commented 4 years ago

Hi Michael - no problem, and of course! Here it is: SingleCellHMM_Run_combined_bam_HMM_features.log

Yes, I believe I have narrowed this issue down to this line of code within a piped line: LC_ALL=C sort -k1,1V -k2,2n --parallel=30 I don't think it's outputting anything. The bedtools command seems to work. Still working on it!

Thanks, Zoe

fw262 commented 4 years ago

Hi Zoe,

Yes it looks like it's failing at the bedtools bamtobed -i ${INPUT_BAM} -split |LC_ALL=C sort -k1,1V -k2,2n --parallel=30| awk '{print $0}' | gzip > ${TMPDIR}/${PREFIX}_split.sorted.bed.gz step. The write failed suggest that there may be a space or access issue to be able to write the output.

Did you try running with the small test dataset?

Thanks, Michael

14zac2 commented 4 years ago

Hi Michael,

I found that my problem was memory, again! Sort needed 70g alone for the combined bam file. I was just surprised this was an issue because the pipeline continued after this rather than failing. I'm combating the particular issue by running the sort on its own, commenting out the line, and rerunning the rule. Just rerunning this now.

I just tried running the pipeline with the test data and I am encountering an issue here, too, in the same rule. It appears as though the chr*_HMM.bed files aren't being created, although combined_bam_split.sorted.bed.gz looks like it was created okay. I have the log file here: SingleCellHMM_Run_combined_bam_HMM_features.log

I'm sorry this is causing so many issues, and really appreciate your help!!

Thanks so much, Zoe

UPDATE: I am now experiencing the same error for both datasets.

14zac2 commented 4 years ago

Hi Michael,

I'm so dumb! The problem was I didn't have certain R packages to install - easy fix. I saw this easily by looking at the log files under combined_bam_HMM_features/toremove/. The test data is working now, and hopefully mine will too! Running it now.

Thanks again for all your help, Zoe

fw262 commented 4 years ago

That's great! Let me know if you run into other issues.

Michael

14zac2 commented 4 years ago

Thanks Michael! Everything is indeed working great!

I hope you don't mind if I ask a couple questions for clarification. I got this error from the test data and my data: Calculating the coverage... ERROR: Database file /dev/fd/63 contains chromosome WCK01_AAF20200214_F8-ctg1000, but the query file does not. Please rerun with the -g option for a genome file. See documentation for details. I assume this is fine and just means that a certain chromosome isn't detected in the fastq files?

Also, when genes appear multiple times in the TAR matrices, that just indicates that the gene is transcribed from multiple locations in the genome, right?

Thanks again! Zoe

fw262 commented 4 years ago

Hi Zoe,

Did you check the TAR_reads.bed.gz to see if it populated? You shouldn't encounter that database file error if TARs are generated.

In the TAR matrix, TARs may appear multiple time with different strandedness. It is rare that 2 TARs would have the exact same coordinates, but if they do, they should have different strandedness.

Michael

14zac2 commented 4 years ago

Hi Michael,

Thanks for your insight! The TAR_reads.bed.gz file is populated, but I did get the error. My suspicion is that it might be because of the thrown error and then rerunning the rule. It looks like there were a lot of files that weren't overwritten when a new file was created during the rule (snakemake -R calcHMMbed). Here is the log file so you can see what I mean: SingleCellHMM_Run_combined_bam_HMM_features.log

Do you think this may have made an incomplete TAR matrix? I thought the error might just be because not all contigs or chromosomes had transcribed reads present in the fasta file.

Thanks! Zoe

fw262 commented 4 years ago

Hi Zoe,

It looks like the test dataset worked out. You can check the expression matrix in the output folder results_out/day4_0.25m/ to make sure there are values.

Michael

14zac2 commented 4 years ago

Fantastic, thanks Michael!

Zoe