Open akahles opened 2 years ago
Hi Andre,
Pardon the long wait. This is my first public repository on git and my first issue, and somehow I missed linking the notifications to my email. :) Welcome to the DICAST's git.
Can I ask you to if you changed the file scripts/ASimulatoR_config.R
? If so, if you could share this file, I'll try an reproduce your bug exactly.
For now, I suspect that you've got max_genes = "NULL"
, which shows me a similar bug. Can I ask that you limit max genes to like 100 or 15000, and try running it again? If it still doesn't work, I'd also try removing all files such as src/ASimulatoR/in/*.rda
.
Meanwhile I'll reach out to ASimulatoR guys about this bug.
Thanks for your inputs so far. I hope we can get ASimulatoR running for you soon. Amit
Hi Amit,
thanks for your reply and no worries about the delay. I had a look and I made no modifications to scripts/ASimulatoR_config.R
. Also, I verified that the current setting in the file is max_genes = 100
.
I tried your suggestion to remove any files src/ASimulatoR/in/*.rda
, but again the run failed. Here again the log. The only difference is that it re-created the superset now.
rule run_asimulator_rule:
input: output/snakemake/log_pulled_base_os.txt
output: output/snakemake/log_ran_asimulator.txt
jobid: 5
resources: tmpdir=/var/folders/17/yzmf8ktd6m5c05vzf7k59y0c0000gn/T
./src/ASimulatoR/run_asimulator.sh: line 14: /opt/DICAST/scripts/mapping_config.sh: No such file or directory
ln: /opt/DICAST/src/ASimulatoR/in/1.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/10.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/11.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/12.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/13.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/14.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/15.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/16.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/17.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/18.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/19.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/2.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/20.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/21.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/22.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/3.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/4.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/5.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/6.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/7.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/8.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/9.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/MT.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/X.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/Y.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/Homo_sapiens.GRCh38.dna.primary_assembly.fa: File exists
ln: /opt/DICAST/src/ASimulatoR/in/Homo_sapiens.GRCh38.105.gtf: File exists
Loading required package: data.table
Loading required package: rtracklayer
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:data.table’:
first, second
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:data.table’:
shift
Loading required package: GenomeInfoDb
Loading required package: polyester
Loading required package: pbmcapply
found the following fasta files: 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 3.fa, 4.fa, 5.fa, 6.fa, 7.fa, 8.fa, 9.fa, Homo_sapiens.GRCh38.dna.primary_assembly.fa, MT.fa, placeholder.fa, X.fa, Y.fa
note that splice variants will only be constructed from chromosomes that have a corresponding fasta file
set data.table threads to 8
importing gtf/gff...
finished importing gtf/gff
creating superset...
finished creating superset
saving superset...
finished saving superset
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’
Calls: do.call ... lapply -> FUN -> <Anonymous> -> <Anonymous> -> <Anonymous>
In addition: Warning messages:
1: In .check_input_dir(input_dir) :
found more than one gtf/gff file in input directory. using /input/Homo_sapiens.GRCh38.105.gtf...
2: In mclapply(X, FUN, ..., mc.cores = mc.cores, mc.preschedule = mc.preschedule, :
scheduled cores 1, 12, 17, 18, 19, 20 did not deliver results, all values of the jobs will be affected
Execution halted
mv: rename /opt/DICAST/src/ASimulatoR/out/*gtf to input/ASimulatoR.gtf: No such file or directory
mv: rename /opt/DICAST/src/ASimulatoR/out/*gff3 to input/ASimulatoR.gff3: No such file or directory
ls: /opt/DICAST/src/ASimulatoR/out/*fastq: No such file or directory
Removing temporary output file output/snakemake/log_pulled_base_os.txt.
Please let me know if any other information is required from my end.
Lastly, one more question. Is the simulation data you used for the evaluations in your DICAST preprint publicly available?
Thanks and Cheers,
Andre
Hi Andre,
Sorry we haven't the data public, because it is several 100GBs big, but if I could have a little more of your patience, ASimulatoR will give you many more datasets for you to try DICAST with. Furthermore, the author of ASimulatoR will join us by next week, if we haven't solved this by then.
Can I ask for the outputs of ls -lah src/ASimulatoR/in
and cat scripts/config.sh
?
Thanks in advance. Amit
Also, perhaps, if it's faster, I would ask you to install ASimulatoR directly in your R environment. https://github.com/biomedbigdata/ASimulatoR.
unfortunately, I still haven't been able to replicate your error message. Funny we were hoping that Docker in linux and mac would behave the same.
Hi Amit,
As requested, following the output of ls -lah src/ASimulatoR/in
(dicast-snakemake) akahles@host:/opt/DICAST$ ls -lah src/ASimulatoR/in
total 14959056
drwxrwxrwx 33 root wheel 1.0K Feb 4 11:10 .
drwxrwxrwx 18 root wheel 576B Jan 7 15:44 ..
-rw-r--r-- 2 akahles wheel 241M Jan 7 14:54 1.fa
-rw-r--r-- 2 akahles wheel 130M Jan 7 14:54 10.fa
-rw-r--r-- 2 akahles wheel 131M Jan 7 14:54 11.fa
-rw-r--r-- 2 akahles wheel 129M Jan 7 14:54 12.fa
-rw-r--r-- 2 akahles wheel 111M Jan 7 14:54 13.fa
-rw-r--r-- 2 akahles wheel 104M Jan 7 14:54 14.fa
-rw-r--r-- 2 akahles wheel 99M Jan 7 14:54 15.fa
-rw-r--r-- 2 akahles wheel 88M Jan 7 14:54 16.fa
-rw-r--r-- 2 akahles wheel 81M Jan 7 14:54 17.fa
-rw-r--r-- 2 akahles wheel 78M Jan 7 14:54 18.fa
-rw-r--r-- 2 akahles wheel 57M Jan 7 14:54 19.fa
-rw-r--r-- 2 akahles wheel 235M Jan 7 14:54 2.fa
-rw-r--r-- 2 akahles wheel 62M Jan 7 14:54 20.fa
-rw-r--r-- 2 akahles wheel 45M Jan 7 14:54 21.fa
-rw-r--r-- 2 akahles wheel 49M Jan 7 14:54 22.fa
-rw-r--r-- 2 akahles wheel 192M Jan 7 14:54 3.fa
-rw-r--r-- 2 akahles wheel 184M Jan 7 14:54 4.fa
-rw-r--r-- 2 akahles wheel 176M Jan 7 14:54 5.fa
-rw-r--r-- 2 akahles wheel 166M Jan 7 14:54 6.fa
-rw-r--r-- 2 akahles wheel 154M Jan 7 14:54 7.fa
-rw-r--r-- 2 akahles wheel 141M Jan 7 14:54 8.fa
-rw-r--r-- 2 akahles wheel 134M Jan 7 14:54 9.fa
-rw-r--r-- 2 akahles wheel 1.3G Jan 7 14:22 Homo_sapiens.GRCh38.105.gtf
-rw-r--r-- 1 akahles wheel 5.9M Feb 4 11:10 Homo_sapiens.GRCh38.105.gtf.exon_superset.rda
-rw-r--r-- 2 akahles wheel 2.9G Jan 7 14:21 Homo_sapiens.GRCh38.dna.primary_assembly.fa
-rw-r--r-- 2 akahles wheel 17K Jan 7 14:54 MT.fa
-rw-r--r-- 2 akahles wheel 151M Jan 7 14:54 X.fa
-rw-r--r-- 2 akahles wheel 55M Jan 7 14:54 Y.fa
-rwxr-xr-x 1 akahles wheel 0B Jan 7 14:58 placeholder.fa
-rwxrwxrwx 1 root wheel 0B Jan 7 09:24 placeholder.gtf
-rwxr-xr-x 1 akahles wheel 1.3K Feb 4 11:04 runASS.R
and
cat scripts/config.sh
############################
# Basic Parameters #
############################
ncores=4 #number of cores or threads the tool will use
workdir=/MOUNT #name of the base directory inside the Docker
outdir=$workdir/output/${tool:-unspecific}-output #name of the output directory; will be named after the specific tool that was used
read_length=100 #length of reads inside fastq files
differential=0
#############################
# Input Directories #
#############################
inputdir=$workdir/input
controlfolder=$inputdir/controldir #base directory for all needed input files (when no differential comparison, control inputs when differential AS Event Detection)
casefolder=$inputdir/casedir #base directory for only case files (for AS Event detection)
fastqdir=$controlfolder/fastqdir #directory for fastqfiles
controlbam=$controlfolder/bamdir
controlfastq=$controlfolder/fastqdir
bamdir=$controlfolder/bamdir #directory for bamfiles
samdir=$controlfolder/bamdir #directory for samfiles
fastadir=$inputdir #directory for fastafile (might vary for specific tools -> see mapping or as-specific config file)
gtfdir=$inputdir #directory for gtffile
gffdir=$inputdir #directory for gfffile
bowtie_fastadir=$inputdir/fasta_chromosomes/
############################
# Input Parameters #
############################
asimulator_gtf=Homo_sapiens.GRCh38.105.gtf #name of the GTF file used to generate simulated data within ASimulatoR R library.
fastaname=Homo_sapiens.GRCh38.dna.primary_assembly.fa #name of the genome reference file (fasta format), directory=$fastadir
gtfname=ASimulatoR.gtf #name of gtf reference file, directory=$gtffile; set to ASimulatoR_gtf.gtf, when ASimulator is true
gffname=ASimulatoR.gff3 #set to ASimulatoR_gff.gff3, when ASimulator is true
fasta=${fastadir}/$fastaname #fasta full path
gtf=${gtfdir}/$gtfname #gtf full path
gff=${gffdir}/$gffname #gff full path
#################################
# Mapping tool Parameters #
#################################
### used only in mapping tools ###
outname=$tool # basename of output file (will usually be prefixed with the fastq file name and suffixed with .sam)
#################
# Index #
#################
recompute_index=false #force index to be computed even if index with $indexname already exists
indexname=${fastaname}_index #basename of index (without eg. .1.bt2 for bowtie index)
star_index=$workdir/index/star_index #folder containing a star index built with the $gtf and $fasta files (used by: IRFinder, KisSplice, rMATS)
indexdir=$workdir/index/${tool:-unspecific}_index #directory of index
I can also give ASimulatoR a try directly. I assume, I could then still use the datasets within DICAST, as long as they are stored in the pre-defined structure?
Cheers, Andre
Hi Andre,
Thank you for you patience, I'd ask you to try a quick hack for me. This is so I may know if this bug comes from something funny DICAST does, vs something that I should talk to the authors of ASimulatoR about.
Can you please make a quick bash script with the code below in your favorite new directory and see if it works. This is to run ASimulatoR with the same configurations as the default run on DICAST. This re-downloads the essential files needed for ASimulatoR.
This is the minimal code needed to run ASimulatoR independently.
#!/bin/bash
mkdir ASimulator/{in,out} -p
# Downloading Human references fron Ensemble's ftp.
link="http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/"
# Downloading bowtie genome fastas for each Chromosome.
for chromosomes in $(curl $link | cut -d ' ' -f2 | cut -d '"' -f3 | grep -v "nonchromosomal\|primary\|toplevel\|dna_\|alt" | grep Homo_sapiens|sed 's/...>//g'| tr -d '>'); do echo Downloading $chromosomes chromosome; curl -o ASimulator/in/$(echo $chromosomes | cut -d '.' -f 5-) $link$chromosomes; done
# Downloading the gtf
curl -o ASimulator/in/Homo_sapiens.GRCh38.105.gtf.gz http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
gzip -d ASimulator/in/Homo_sapiens.GRCh38.105.gtf.gz
gzip -d ASimulator/in/*fa.gz
# Modify this file after download, to customize your dataset
curl -o ASimulator/in/runASS.R https://raw.githubusercontent.com/CGAT-Group/DICAST/master/scripts/ASimulatoR_config.R
# Command to run ASimulatoR through the official docker.
docker run --rm --name $USER-$RANDOM-dicast-$tool --user $(id -u):$(id -g) -v $(pwd)/ASimulator/in:/input -v $(pwd)/ASimulator/out:/output biomedbigdata/asimulator
I'd copy the .fastq
files from the newly created ASimulatoR/out/
to the place <DICAST-working-dir>/input/controldir/fastqdir/
for the rest of DICAST to evaluate these files.
Furthermore, I'd ask you to copy the ASimulatoR/out/event_annotation.tsv
from this run to the location <DICAST-working-dir>/src/ASimulatoR/out/event_annotation.tsv
.
Everything else should work fine. I hope this gives you the files needed to run ASimulatoR and the rest of DICAST. Let me know how it goes, this should give me a lot more clues to try and narrow down the bug from my side. Thanking you in advance. Amit
Hi Amit,
thanks for posting the script. I gave it a try, but got the same error in the end. I skip the output of the download section in the beginning and will only paste the log after:
Loading required package: data.table
Loading required package: rtracklayer
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:data.table’:
first, second
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:data.table’:
shift
Loading required package: GenomeInfoDb
Loading required package: polyester
Loading required package: pbmcapply
found the following fasta files: 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 3.fa, 4.fa, 5.fa, 6.fa, 7.fa, 8.fa, 9.fa, MT.fa, X.fa, Y.fa
note that splice variants will only be constructed from chromosomes that have a corresponding fasta file
set data.table threads to 8
importing gtf/gff...
finished importing gtf/gff
creating superset...
finished creating superset
saving superset...
finished saving superset
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’
Calls: do.call ... lapply -> FUN -> <Anonymous> -> <Anonymous> -> <Anonymous>
In addition: Warning message:
In mclapply(X, FUN, ..., mc.cores = mc.cores, mc.preschedule = mc.preschedule, :
scheduled cores 9, 16, 17, 18, 19, 20 did not deliver results, all values of the jobs will be affected
Execution halted
I did not make any changes to the script that you posted above. Let me know if I should have.
Thanks and Cheers,
Andre
Thank you Andre, for your quick response. No, there were no changes to the previous script needed.
Are you running this in a Mac environment? Can we have the specs from your machine and of the OS? Do you have access to a linux machine you could use? if not, we can try and figure out how to transfer the data we had from ASimulatoR for transparency's sake.
Unfortunately this might be where we learn that ASimulatoR doesn't run on mac and maybe DICAST too :(. I'll wait to hear back from ASimulatoR's author.
Hi Amit,
I am running on a Mac with the following setup:
Let me know if you need any more details.
Happy to try it on an a linux machine. I will let you know how it went.
Just out of curiosity, would the setup also run with Singularity instead of Docker?
Thanks and Cheers,
Andre
Thanks Andre,
This is perfect. We do plan to develop on Singularity soon, but unfortunately this is still in future work.. We wanted to start with dockers and port docker images to singularity images. Stay tuned at this repo for further news.
Thanks for your support so far. Amit
Hi Andre,
Finally, I found some time to look at this issue and the only thing I can identify is that somehow the exon_superset file gets corrupted. You have helped us a lot already, might I ask you to try to use the attached superset instead of the one you generated? You'll have to rename it to Homo_sapiens.GRCh38.105.gtf.exon_superset.rda
because github doesn't allow that file extension.
Homo_sapiens.GRCh38.105.gtf.exon_superset.txt
Thank you in advance. Quirin
Hi Quirin, I received the following error message when I tried to run ASimulatoR.
finished loading superset
assign variants to supersets... create splicing variants and annotation. This may take a while... finished creating splicing variants and annotation
exporting gtf for read simulation...
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Calls: do.call ...
Hi @NormanRog, is this related to this issue? If so, did you use the script posted above? If not, please open another issue, in the ASimulatoR repository, with more information on how you called the function. I will probably only find time to look at this at the end of the week, so I would appreciate it if you could provide as much information as possible.
Best, Quirin
After looking into this more deeply, this looks like a memory issue because of too many processes being spawned.
ncores
seems to be set to 20 although only 8 were available.
set data.table threads to 8
but
scheduled cores 9, 16, 17, 18, 19, 20 did not deliver results, all values of the jobs will be affected
Forking 20 processes probably took too much memory, which led to some being killed and not delivering results in both steps: creating the superset and7or the variants (a corrupted superset will lead to the error unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’
).
I added a few lines to the ASimulatoR for better documentation and limited ncores
to the number of available ones. This might still be not enough. I would recommend monitoring the memory usage while simulating.
Hope this helps. Best, Quirin
Dear DICAST team,
After some setup issues (outlined in #1), I was able to successfully start the GUI. I selected to simulate reads with ASimulatoR and am currently stuck at the following error message:
As a setup procedure, I used the script
initializing-dicast.sh
to populate the input structure and unzipped its contents. Then I started the GUI, selected the input directory, acknowledged possible overwrites and ticked the box for "Do you want to run ASimulatoR?".Let me know if you need additional info from my side.
Cheers,
Andre