YangLab / SCAPTURE

Other
18 stars 4 forks source link

PAScall does not complete - issue in bedtools? #2

Open hwessels opened 3 years ago

hwessels commented 3 years ago

Hi,

I am running PAScall on my Cellranger3.0 output using ensembl v97 annotation (all annotation is ensemble / no "chr" pasted in front of the chromosome names). PAScall runs until it starts using bedtools.

environment:

module purge
module load scapture/1.0.1 # with python3 virtual environment
module load subread/2.0.3 # contains feature counts
module load kentutils/302.1 # contains genePredToBed and gtfToGenePred
module load bedtools/2.29.0

command:

scapture -m PAScall -a SCAPTURE_annotation -g genome.fa -b sample.bam  -l 98 -o name -p 20 --polyaDB SupTab_KnownPASs_fourDBs.txt &> PAScall.log

it generates the files name.genetype.count.txt and name.CallPasPerGene.sh. All subsequent files are empty. Can you point me at what is happening?

log file excerpt:

scapture path: /nfs/sw/scapture/scapture-1.0.1/
DeepPASS model file: /nfs/sw/scapture/scapture-1.0.1//DeepPASS/best_model.h5
scapture module: PAScall
Output prefix: name
prefix of annotation files from annotation module: SCAPTURE_annotation
BAM file: sample.bam
Fragment length: 98
GENOME file: genome.fa
Peak width: 400
OverlapRatio: 0.5
threads: 20
poly(a) database file: SupTab_KnownPASs_fourDBs.txt
scapture PAScall: create command line. Wed Sep  1 12:15:39 EDT 2021
scapture PAScall: create command line done. Wed Sep  1 13:33:41 EDT 2021
scapture PAScall: peak calling. Wed Sep  1 13:33:41 EDT 2021
depth: invalid option -- 'd'
open: No such file or directory
/nfs/sw/scapture/scapture-1.0.1/scapture_callpeak: line 120: 142881 Done                    bedtools intersect -a $PREFIX"."$GeneName".bam" -b $PREFIX"."$GeneName".bed" -split -f 0.95 -u -bed
     142882 Broken pipe             | bedtools bedtobam -i - -bed12 -g $Chromsize
     142883 Segmentation fault      | samtools depth -d 0 - > $PREFIX"."$GeneName".cov"

One uncertainty I have is the identity of the SupTab_KnownPASs_fourDBs.txt file. I did not find it in you GitHub repository. Therefore, I generated it from a supplementary file from the paper as a 4 column tab delimited file with header. Not sure if this looks like it is supposed to look.

Location    Strand  Source  Overlapped number
chr1:16441-16442    -   PolyADB3    3
chr1:16442-16443    -   PolyA-Seq   3
chr1:16442-16452    -   PolyASite   3
liguowei-CAS commented 3 years ago

Hi hwessels,

In the posted log file, an error from samtools occurred: depth: invalid option -- 'd' Please reply with the version of bedtools and samtools you used, we will check this out.

For specifically, SCAPTURE uses bedtools v2.26.0 and samtools 1.9. And "-d" option for samtools was: -d/-m <int> maximum coverage depth [8000]. If 0, depth is set to the maximum integer value, effectively removing any depth limit.

Otherwise, SupTab_KnownPASs_fourDBs.txt for human hg38 genome could be found at https://drive.google.com/file/d/14ZQo9FWmMDRvIcR6mqXVMfIdQRKsaEL3/view?usp=sharing. Sorry for the unclear description, we now update in README.

rpghjs commented 3 years ago

Hi, I'm curious if you provide mm10's knownPASs data sets. I have once clicked on the link "mm10".But there's nothing in it. I am looking forward your reply.

liguowei-CAS commented 3 years ago

Hi,

We now have the latest release v1.1.0, supporting both poly(A) site prediction for both human and mouse ! The parameter is specifically added at PAScall step.

And the softlink of mm10 knownPASs set was fixed, thanks.

hwessels commented 3 years ago

Hi, Thanks for your quick replies. When did you release 1.1? I pulled in on August 31st being v1.0.1. Besides that, I have updated the software to run it with the suggested versions, and updated the pA db file.

module purge
module load scapture/1.0.1 # with python3 virtual environment
module load subread/2.0.3 # contains feature counts
module load kentutils/302.1 # contains genePredToBed and gtfToGenePred
module load bedtools/2.26.0 
module load samtools/1.9

the -d samtools error is fixed. However, it still doesn't create useful output files (empty files).

One issue seems to by that python couldn't import numpy.

Besides that, here the start and end of the log file:

scapture path: /nfs/sw/scapture/scapture-1.0.1/
DeepPASS model file: /nfs/sw/scapture/scapture-1.0.1//DeepPASS/best_model.h5
scapture module: PAScall
Output prefix: CPA
prefix of annotation files from annotation module: SCAPTURE_annotation
BAM file: sample.bam
Fragment length: 98
GENOME file: genome.fa
Peak width: 400
OverlapRatio: 0.5
threads: 20
poly(a) database file: SupTab_KnownPASs_fourDBs.txt
scapture PAScall: create command line. Thu Sep  2 10:35:08 EDT 2021
scapture PAScall: create command line done. Thu Sep  2 11:46:29 EDT 2021
scapture PAScall: peak calling. Thu Sep  2 11:46:29 EDT 2021
Differing number of BED fields encountered at line: 129.  Exiting...
Differing number of BED fields encountered at line: 129.  Exiting...
Differing number of BED fields encountered at line: 135.  Exiting...
Differing number of BED fields encountered at line: 1205.  Exiting...
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
Differing number of BED fields encountered at line: 129.  Exiting...
Differing number of BED fields encountered at line: 7436.  Exiting...
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
It looks as though you have less than 3 columns at line: 1385.  Are you sure your files are tab-delimited?
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
...
...
...
sort: cannot read: CPA_tmp/CPA.exonic.peak.isoform.normality.aggregate.tmp: No such file or directory
rm: cannot remove ‘CPA_tmp/CPA.exonic.peak.isoform.normality.aggregate.tmp’: No such file or directory
sort: cannot read: CPA_tmp/CPA.exonic.peaks.tmp: No such file or directory
rm: cannot remove ‘CPA_tmp/CPA.exonic.peaks.tmp’: No such file or directory
sort: cannot read: CPA_tmp/CPA.intronic.rawpeak.tmp: No such file or directory
rm: cannot remove ‘CPA_tmp/CPA.intronic.rawpeak.tmp’: No such file or directory
sort: cannot read: CPA_tmp/CPA.intronic.peak.aggregate.tmp: No such file or directory
rm: cannot remove ‘CPA_tmp/CPA.intronic.peak.aggregate.tmp’: No such file or directory
sort: cannot read: CPA_tmp/CPA.intronic.peaks.tmp: No such file or directory
rm: cannot remove ‘CPA_tmp/CPA.intronic.peaks.tmp’: No such file or directory
scapture PAScall: peak calling done. Thu Sep  2 15:15:26 EDT 2021
scapture PAScall: peak annotating. Thu Sep  2 15:15:26 EDT 2021
scapture PAScall: peak annotating done. Thu Sep  2 15:15:30 EDT 2021
scapture PAScall: PAS evaluating. Thu Sep  2 15:15:30 EDT 2021
Error: malformed BED entry at line 1. Start was greater than end. Exiting.
cat: CPA.exonic.peaks.DeepPASS.predictout/Predict_Result.txt: No such file or directory
***** ERROR: too many digits/characters for integer conversion in string . Exiting...
Error: malformed BED entry at line 1. Start was greater than end. Exiting.
cat: CPA.intronic.peaks.DeepPASS.predictout/Predict_Result.txt: No such file or directory
***** ERROR: too many digits/characters for integer conversion in string . Exiting...
Error: malformed BED entry at line 1. Start was greater than end. Exiting.
cat: CPA.3primeExtended.peaks.DeepPASS.predictout/Predict_Result.txt: No such file or directory
***** ERROR: too many digits/characters for integer conversion in string . Exiting...
scapture PAScall: output files --  CPA.exonic.peaks.evaluated.bed CPA.intronic.peaks.evaluated.bed CPA.3primeExtended.peaks.evaluated.bed
scapture PAScall: PAS evaluating done. Thu Sep  2 15:15:30 EDT 2021
rpghjs commented 3 years ago

你好,

我们现在拥有最新版本 v1.1.0,支持人类和小鼠的 poly(A) 位点预测! 该参数是在 PAScall 步骤中专门添加的。

并且mm10 knownPAS设置的软链接已修复,谢谢。

Thanks for your reply

liguowei-CAS commented 3 years ago

Hi,

Version 1.1 was released on Sep 2nd. If you only focus on the human species, 1.1 is not necessary.

The error was caused by the uncorrected format of the input BED file. Please create and offer us some subset files that also could cause the same error, we will help to figure it out.

hwessels commented 3 years ago

Hi again, yes, I cannot tell which bed file is the problematic one. here I am sharing some chr20 files with you including 10000 reads on chr20. https://drive.google.com/drive/folders/1XRH_HG14XEixzQvZbF11GRwb2ezwZjVl?usp=sharing

thanks for looking into that

liguowei-CAS commented 3 years ago

Hi hwessels, I recived your files and created annotation with given " chr20_genes.gtf". It worked out fine in this step, . See below:

scapture -m annotation -o SCAPTURE_annotation -g genome.fa --gtf chr20_genes.gtf --cs Chr20_SCAPTURE_annotation.chromsize --extend 2000 &> annotation.log cat annotation.log

scapture path: /picb/rnomics4/SCAPTURE-main/ scapture module: annotation Output prefix: SCAPTURE_annotation GENCODE GTF: chr20_genes.gtf GENOME file: genome.fa chrom size file: Chr20_SCAPTURE_annotation.chromsize Generate annotation file: SCAPTURE_annotation.genetype.gtf SCAPTURE_annotation.genetype.txt SCAPTURE_annotation.genetype.bed SCAPTURE_annotation.genetype.extended.bed SCAPTURE_annotation.chromsize output annotation file prefix: SCAPTURE_annotation

Next, for the PAScall step, no error occurred in poly(A) site calling process. Thus, I couldn't repeat the situation you mentioned above. There was an error in poly(A) site evaluating process, which was caused by the empty intronic input file and can be ignored in this case.

scapture -m PAScall -a SCAPTURE_annotation -g genome.fa -b chr20_reads.bam -l 100 -o output -p 1 --polyaDB SupTab_KnownPASs_fourDBs.txt &> PAScall.log cat PAScall.log

scapture path: /picb/rnomics4/SCAPTURE-main/ DeepPASS model file: /picb/rnomics4/SCAPTURE-main//DeepPASS/best_model.h5 scapture module: PAScall Output prefix: output prefix of annotation files from annotation module: SCAPTURE_annotation BAM file: chr20_reads.bam Fragment length: 100 GENOME file: genome.fa Peak width: 400 OverlapRatio: 0.5 threads: 1 poly(a) database file: SupTab_KnownPASs_fourDBs.txt scapture PAScall: create command line. Mon Sep 6 10:54:56 CST 2021 scapture PAScall: create command line done. Mon Sep 6 10:54:59 CST 2021 scapture PAScall: peak calling. Mon Sep 6 10:54:59 CST 2021 scapture PAScall: peak calling done. Mon Sep 6 10:55:50 CST 2021 scapture PAScall: peak annotating. Mon Sep 6 10:55:50 CST 2021 scapture PAScall: peak annotating done. Mon Sep 6 10:55:50 CST 2021 scapture PAScall: PAS evaluating. Mon Sep 6 10:55:50 CST 2021 Error: malformed BED entry at line 1. Start was greater than end. Exiting. cat: output.intronic.peaks.DeepPASS.predictout/Predict_Result.txt: No such file or directory ***** ERROR: too many digits/characters for integer conversion in string . Exiting... scapture PAScall: output files -- output.exonic.peaks.evaluated.bed output.intronic.peaks.evaluated.bed output.3primeExtended.peaks.evaluated.bed scapture PAScall: PAS evaluating done. Mon Sep 6 10:56:17 CST 2021

I recommend reinstall environment with conda env create -f SCAPTURE_env.yaml from yaml file. Otherwise, you could share the full files with google drive.

rpghjs commented 2 years ago

Hi, When i running the step of PASCALL,I saw the ployaDB in optional.So So I didn't choose that option.Now i have a problem that I have a huge amount of data to sift through right now (with 70,000 PASs).How can i use the polyaDB to filter the data via myself instead of running PAScall.

liguowei-CAS commented 2 years ago

See #5