COMBINE-lab / SalmonTools

Useful tools for working with Salmon output
BSD 3-Clause "New" or "Revised" License
36 stars 20 forks source link

Segmentation fault on MashMap step of generateDecoyTranscriptome.sh #5

Open jaclyn-taroni opened 5 years ago

jaclyn-taroni commented 5 years ago

Hi all,

I get Segmentation fault (core dumped) on step 3 of generateDecoyTranscriptome.sh.

I've filed https://github.com/marbl/MashMap/issues/21 upstream with more detailed information. I wanted to file an issue here in case you have any insight or I am using the script improperly.

Here's how I'm using this:

bash scripts/generateDecoyTranscriptome.sh \
    -j 8 \
    -g Homo_sapiens.GRCh38.dna.toplevel.fa \
    -t Homo_sapiens.GRCh38.cdna.all.fa \
    -a Homo_sapiens.GRCh38.96.gtf \
        -o ${human_output}

I realize you have gentrome.fa and decoys.txt for human here: https://github.com/COMBINE-lab/salmon#pre-computed-decoy-transcriptomes

I'm interested in generating this for zebrafish and happened to run into this problem with human first/before I found that on the Salmon README.

Thank you!

k3yavi commented 5 years ago

Hi @jaclyn-taroni ,

Thanks for raising this issue, one other user is also facing the similar issue with human genome. While MashMap peeps and we are looking for the cause and the solution for the problem, if you can forward me the links to zebrafish genome and gtf I can run it in our system and forward to you the decoy sequences.

jaclyn-taroni commented 5 years ago

Hi @k3yavi,

Thanks for the quick reply and the offer. I was planning on using the most recent Ensembl release for zebrafish. Here are the relevant links:

ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz

Thanks again!

rob-p commented 5 years ago

Hi @jaclyn-taroni,

@k3yavi has built the decoy transcriptome for zebrafish, you can grab it from the link on the salmon readme.

--Rob

jaclyn-taroni commented 5 years ago

Thank you very much @k3yavi and @rob-p!

cmatKhan commented 5 years ago

hi @k3yavi

I'm getting the same error with data from a tick species -- any chance you'd be willing to run this for me, too?

The genome is (we use the first one, Ixodes-Scapularis-IES6_...): https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=457&field_download_file_format_tid=All&field_status_value=Current

The .gtf (ISE6, same as above): https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=412&field_download_file_format_tid=473&field_status_value=Current

And a transcriptome that is as of yet unpublished/posted -- I'd have to send it.

k3yavi commented 5 years ago

Hi @cmatKhan , Ixodes_scapularis.tar.gz should do it.

choulabucsf commented 5 years ago

Very much appreciated.

I realized after I hit send that there is a transcriptome on vectorbase -- I assume that's what you used?

k3yavi commented 5 years ago

Actually I just used the gtf and the genome to extract the transcriptome .

k3yavi commented 5 years ago

Hi Guys,

Just to give the heads up, we have curated the decoys sequence of a subset of model organism and it can be found here.

doubtfulresearch commented 5 years ago

I'm having this issue as well, I've tried it on a couple machines although the most RAM so far is 24GB (20 free).

Any chance you could generate decoys for refseq human and mouse? They give GFF annotation files, I was feeding that directly into step 2 (instead of the exons.bed) and step 2 completes fine, but step 3 fails pretty early with segmentation fault.

Alternatively, can you give an estimate of how much RAM this script is using on your machine where it successfully completes? Also, how long do you typically find it takes? I've not used MashMap before. I tried doing a trial run with a smaller genome and gave it 10 threads and while it didn't have a segmentation fault, after ~ 6 hours in step 3 I gave up since I didn't really need the decoys but was surprised at how long it was taking.

Thanks!

k3yavi commented 5 years ago

Hi, please fill the following decoy generation request form https://forms.gle/3baJc5SYrkSWb1z48 and we will let you know once we have the decoys.

On our machine it was taking ~100G and approximately an hour to run for human gencode data.

Thanks !

k3yavi commented 4 years ago

Hi guys,

Just wanted to let you know, we recently released a new version of salmon where you don't have to explicitly run the mashmap pipeline. With v1.0 salmon can consume both the genome and transcriptome without the need of annotations. Please checkout the new preprint or follow this tutorial for redindexing.

lpantano commented 4 years ago

Thank you so much! I asked in the chat, but just in case. Any estimation of memory during index and quantification, assuming a human genome like reference? Thanks!

rob-p commented 4 years ago

Hi @lpantano,

The indexing using the entire human genome as decoy and the whole transcriptome (gencode v29) as the actual target sequence takes ~20G of RAM in our runs. The final (dense) index size is ~19G so construction RAM is only a little bit more. Interestingly, while the final index for using the whole genome as decoy is considerably bigger than if one uses the mashmap decoy sequences, the indexing memory is quite a bit smaller.