COMBINE-lab / SalmonTools

Useful tools for working with Salmon output
BSD 3-Clause "New" or "Revised" License
36 stars 20 forks source link

generateDecoyTranscriptome.sh gets ABORTED #7

Closed guidohooiveld closed 5 years ago

guidohooiveld commented 5 years ago

Hi, I got stuck while trying to generate a hybrid fasta decoy file. The script is aborted after the file 'reference.masked.genome.fa' is generated. Since the script is not 'killed' or 'segmentation faulted' (i.e. two issues reported here before) I decided to open a new issue. Any suggestions would be appreciated. Thanks!

Source files: ftp://ftp.ensemblgenomes.org/pub/metazoa/release-44/fasta/lottia_gigantea

My code:

[guidoh@localhost Work]$ bash generateDecoyTranscriptome.sh \
>  -j 1 \
>  -g ./ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa.gz \
>  -t ./ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa.gz \
>  -a ./ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3.gz \
>  -o gentrome.files
****************
*** getDecoy ***
****************
-j <Concurrency level> = 1
-g <Genome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa.gz
-t <Transcriptome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa.gz
-a <Annotation GTF file> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3.gz
-o <Output files Path> = gentrome.files
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
generateDecoyTranscriptome.sh: line 101: 23362 Aborted                 (core dumped) $bedtools maskfasta -fi $genomefile -bed exons.bed -fo reference.masked.genome.fa

***************
*** ABORTED ***
***************

An error occurred. Exiting...
[guidoh@localhost Work]$ 
k3yavi commented 5 years ago

Hi @guidohooiveld ,

can you try GTF instead of gff and check if it works ?

k3yavi commented 5 years ago

We have generated the decoys for some of the organisms here.

guidohooiveld commented 5 years ago

Wow, thanks for your fast reply!

Hi @guidohooiveld ,

can you try GTF instead of gff and check if it works ?

I tried, but it didn't work either.... I got the exact same error as when using the gff file...??

We have generated the decoys for some of the organisms here.

THANKS! Just generated an index using this file.

k3yavi commented 5 years ago

Glad to hear that, if the index is online then it must have run through the gtf file, I am not sure what went wrong on your local run, may be you can check if enough disk or ram is available ?

guidohooiveld commented 5 years ago

Well, I don't know how to quickly check how much memory (RAM) was used, but there is sufficient disk space available (>1.5TB). Moreover, aborting goes very fast (see below).

FWIW I also have appended my system info below.

***************
*** ABORTED ***
***************

An error occurred. Exiting...
Command exited with non-zero status 1
        Command being timed: "bash generateDecoyTranscriptome.sh -j 1 -g ./ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa.gz -t ./ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa.gz -a ./ENSEMBL/Lottia_gigantea.Lotgi1.44.gtf.gz -o gentrome.files"
        User time (seconds): 0.03
        System time (seconds): 0.81
        Percent of CPU this job got: 67%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1374
        Voluntary context switches: 39
        Involuntary context switches: 9
        Swaps: 0
        File system inputs: 0
        File system outputs: 2048
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1
[guidoh@localhost Work]$ 

[guidoh@localhost Work]$ cat /etc/os-release
NAME=Fedora
VERSION="30 (Thirty)"
ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Thirty)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
[guidoh@localhost Work]$ 
k3yavi commented 5 years ago

Ah I think I know what's happening, retrospectively, it should have been more clear in the document but can you quickly check how many lines you have in exons.bed file ? https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh#L97

I think I am assuming the gtf file to be unzipped which might be the issue here.

guidohooiveld commented 5 years ago

Ah I think I know what's happening, retrospectively, it should have been more clear in the document I think I am assuming the gtf file to be unzipped which might be the issue here.

Yes, you are correct! I indeed did use the compressed .gz files. After extracting these archives, the script runs fine; both with the GFF and GTF3 annotation files. :+1:

Thanks for your prompt assistance; much appreciated!

[guidoh@localhost Work]$ bash generateDecoyTranscriptome.sh \
>  -j 4 \
>  -g ./ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa \
>  -t ./ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa \
>  -a ./ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3 \
>         -o gentrome.files
****************
*** getDecoy ***
****************
-j <Concurrency level> = 4
-g <Genome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa
-t <Transcriptome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa
-a <Annotation GTF file> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3
-o <Output files Path> = gentrome.files
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
[3/10] Aligning transcriptome to genome
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [/Work/ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 4
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 90283507
INFO, skch::Sketch::index, unique minimizers = 55140979
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 42768451) ... (82514, 1)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 1105 times during lookup.
INFO, skch::main, Time spent computing the reference index: 48.0952 sec
INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [17706, 18589, 23349]
INFO, skch::main, Time spent mapping the query : 151.301 sec
INFO, skch::main, mapping results saved in : mashmap.out
[4/10] Extracting intervals from mashmap alignments
[5/10] Merging the intervals
[6/10] Extracting sequences from the genome
index file reference.masked.genome.fa.fai not found, generating...
[7/10] Concatenating to get decoy sequences
[8/10] Making gentrome
[9/10] Extracting decoy sequence ids
[10/10] Removing temporary files

**********************************************
*** DONE Processing ...
*** You can use files `$outfolder/gentrome.fa` 
*** and $outfolder/decoys.txt` with 
*** `salmon index`
**********************************************
k3yavi commented 5 years ago

Awesome !! Thanks for confirming and I'll update the doc or the script in a bit.

YBellaicheLab commented 4 years ago

Hi Avi, I ran into the same problem. Maybe you could update the description in the Salmon tools Read.me file? In addition, the realpath command is native to ubuntu so I had to source the script below that I found on stackoverflow to work around it. Could maybe be helpful for some of the other users.

Keep up the great work.

Eric

!/bin/sh

realpath() { OURPWD=$PWD cd "$(dirname "$1")" LINK=$(readlink "$(basename "$1")") while [ "$LINK" ]; do cd "$(dirname "$LINK")" LINK=$(readlink "$(basename "$1")") done REALPATH="$PWD/$(basename "$1")" cd "$OURPWD" echo "$REALPATH" }

erzakiev commented 1 year ago

Yes, indeed all files (gtf, genome and txome) have to be unarchived first in order for the script to not abort...

erzakiev commented 1 year ago

We have generated the decoys for some of the organisms here.

link's dead :(

rob-p commented 1 year ago

@k3yavi — any idea what happened to this link, or how we might relocate it?

k3yavi commented 1 year ago

I think this was associated with the Stony Brook account, and I lost access, unfortunately, after graduation.

k3yavi commented 1 year ago

@erzakiev If you face any issues running the scripts, let us know, we'd be happy to help.

erzakiev commented 1 year ago

@rob-p so kind of you to chime in, thank you! Just wanted to take the opportunity to personally express my utmost respect to the Salmon-Senpai for all your hard work.

@k3yavi thanks for the script (and also for your hard work in maintaining/improving salmon!! I see you are the second-highest contributor!) and also thanks for reaching out. Frankly, I tried running the script locally just now and mashmap crashed my machine after reaching 35Gb memory consumption; of note that I have 64Gb of ram but i do something in the background and can't afford to dedicate the whole machine for god knows how long to the sole purpose of generating the decoys... So I guess I won't try it again, maybe over the weekend.

But there must be someone who already generated these, right? I mean, lots of people use Salmon (hopefully, with decoys). The reference genomes, tx-omes and gtf files are also more or less standardized..

k3yavi commented 1 year ago

Thanks for your kind words.

Are you indexing the human genome? It can be accessed through refgenie http://refgenomes.databio.org/v3/genomes/splash/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4

erzakiev commented 1 year ago

Thanks for your kind words.

De rien, you are doing a great job!

Are you indexing the human genome? It can be accessed through refgenie http://refgenomes.databio.org/v3/genomes/splash/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4

Thank you for the link, I indeed work with Human genome. In generating this reference index, I wonder from which Ensembl release the input files were used? The latest?

I also work with mouse genomes GRCm39, MM129 and c57bl6, however. I snooped around on the website you linked to in search for these and it just crashes when I try to navigate to other panes:

Screenshot 2023-06-06 at 16 51 57
rob-p commented 1 year ago

I would also like to thank you for the kind words @erzakiev! It is salmon's users as much as (in fact, more than) our efforts who have made our software successful and better over time.

I'd also like to chime in here and suggest to @k3yavi that perhaps we can evaluate a newer version of mash map, or one of the newer mash map like tools, as they have improved in efficiency since our original paper and script. The funny thing is extracting the decoy sequences with mash map is the most memory intensive step. With the decoys in hand, indexing with salmon actually takes much less memory...

Best, Rob

k3yavi commented 1 year ago

I think the developers of refgenie would be better suited to provide the details about the exact versions of the references, although you can access mouse, rat, and human indices here

erzakiev commented 1 year ago

PS: I realize now that the script was probably never intended to be run on a local machine, correct? The memory consumption (see the screenshot below) and the necessity to specifying the paths to mashmap and bedtools explicitly now make sense. I'll try to run it on my communal cluster.

Screenshot 2023-06-06 at 17 09 13

On my local machine, even with the 150Gb memory used, it still is killed after a certain time:

****************
*** getDecoy ***
****************
-j <Concurrency level> = 12
-b <bedtools binary> = /opt/homebrew/Cellar/bedtools/2.30.0/bin/bedtools
-m <mashmap binary> = /Users/administrateur/miniconda3/envs/mashmap/bin/mashmap
-a <Annotation GTF file> = /Users/administrateur/Analyses/Alignment/Homo_sapiens.GRCh38.109.gtf
-g <Genome fasta> = /Users/administrateur/Analyses/Alignment/Homo_sapiens.GRCh38.dna.primary_assembly.fa
-t <Transcriptome fasta> = /Users/administrateur/Analyses/Alignment/Homo_sapiens.GRCh38.cdna.all.fa
-o <Output files Path> = GRCh38_index_SA
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
[3/10] Aligning transcriptome to genome
[mashmap] Reference = [reference.masked.genome.fa]
[mashmap] Query = [/Users/administrateur/Analyses/Alignment/Homo_sapiens.GRCh38.cdna.all.fa]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 420
[mashmap] Segment length = 500 (read split allowed)
[mashmap] Block length min = 500
[mashmap] Chaining gap max = 500
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 80%
[mashmap] Do not skip self mappings
[mashmap] Hypergeometric filter w/ delta = 0 and confidence 0.999
[mashmap] Mapping output file = mashmap.out
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 12
[mashmap::skch::Sketch::build] minmer windows picked from reference = 2767804093
./generateDecoyTranscriptome.sh: line 105:  2276 Killed: 9               $mashmap -r reference.masked.genome.fa -q $txpfile -t $threads --pi 80 -s 500

***************
*** ABORTED ***
***************

An error occurred. Exiting...