RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

Install transrate on Raven #1503

Closed mattgeorgephd closed 2 years ago

mattgeorgephd commented 2 years ago

https://hibberdlab.com/transrate/installation.html

kubu4 commented 2 years ago

Unfortunately, I think transrate might be dunzo: https://github.com/easybuilders/easybuild-easyconfigs/issues/12099#issuecomment-849931567

I'll continue to poke around a bit and see if I can build it from the source file, but I'm not totally sure...

sr320 commented 2 years ago

Based on earlier exchange- likely not needed?

mattgeorgephd commented 2 years ago

131 samples from two tissues in mytilus trossulus - no published genome. Closely related to mytilus edulis, galloprovincialis, and californianus.

I'm getting ~35% alignment with the mytilus edulis genome and ~38% alignment with the mytilus galloprovincialis genome with hisat2.

The edulis genome assembly is to the chromosome level, but is not annotated. The gallo assembly is to the scaffold level and has a genomic GFF.

There is another scaffold level assembly for edulis that is annotated, as well as one for a more distantly related chinese mytilus species.

A generalized analysis would be great, but I'm really interested in the expression of a number of key foot proteins. The sequences of a couple of the proteins have been described and are similar across species.

@kubu4 @sr320 Any suggestions on the best pathway forward?

sr320 commented 2 years ago

I think it is worth making you own trossulus transcriptome with SRA data eg https://d.pr/i/F7uYrH

kubu4 commented 2 years ago

@mattgeorgephd Do you have library prep info, trimming params, and FastQC info/data we can glance at?

mattgeorgephd commented 2 years ago

Can find info on library prep, here is link to job JA22078

MultiQC reports: untrimmed, trimmed and merged

Code used to trim and merge below:

trim adapter sequences

mkdir trim-fastq/
cd raw-data

for F in *.fastq
do
#strip .fastq and directory structure from each file, then
# add suffice .trim to create output name for each file
results_file="$(basename -a $F)"

# run cutadapt on each file
/home/shared/8TB_HDD_02/mattgeorgephd/.local/bin/cutadapt $F -a A{8} -a G{8} -a AGATCGG -u 15 -m 20 -o \
/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trim-fastq/$results_file
done

merge lanes after trimming

mkdir merged-fastq
cd trim-fastq/

printf '%s\n' *.fastq | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq >"${prefix}_R1.fastq"
done

# I moved files to merged-fastq
kubu4 commented 2 years ago

Thanks! Things look good with that stuff, so I'd continue to go forward with @sr320's suggestion.

sr320 commented 2 years ago

pretty sure something like the following works on raven to get SRAs down...

/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump --split-files SRR322877 -O /home/shared/8TB_HDD_01/sr320/ncbi/ -e 20
kubu4 commented 2 years ago

I've installed transrate on Raven.

It has not been tested with any data. To run it, type the following at a command prompt and press Enter:

transrate

mattgeorgephd commented 2 years ago

@kubu4 transrate works at the terminal but not in bash chunks. I get this error:

"tmp/RtmplzSQQ7/chunk-code-8c68537fedc5.txt: line 1: transrate: command not found"

kubu4 commented 2 years ago

Probably related to the default environment that R loads, which is different than the default environment that is loaded by bash when you login to a terminal. The different environments mean that programs listed in the system $PATH differ, and it looks like the environment loaded by R doesn't have transrate in the system $PATH.

Try specifying the full path to transrate (found by running this in a terminal: which transrate) in your bash chunk and see if that works:

/usr/share/rvm/gems/ruby-3.0.0/bin/transrate

Alternatively, In the bash chunk, try running this before the transrate command (in the same chunk):

rvm use ruby-3.0.0

That loads the Ruby version manager and tells it to use ruby-3.0.0, which was used to install transrate. I think this will make transrate available in the system $PATH, but only in that single bash chunk.

mattgeorgephd commented 2 years ago

After running this:

/usr/share/rvm/bin/rvm use ruby-3.0.0

/usr/share/rvm/gems/ruby-3.0.0/bin/transrate --reference sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta \
          --left trim-fastq/T001F_S100_L001_R1_001.fastq \
          --right trim-fastq/T001F_S100_L002_R1_001.fastq \
          --threads 48
          --output transrate_output/  

I'm getting this error:

RVM is not a function, selecting rubies with 'rvm use ...' will not work.

You need to change your terminal emulator preferences to allow login shell.
Sometimes it is required to use `/bin/bash --login` as the command.
Please visit https://rvm.io/integration/gnome-terminal/ for an example.

/usr/share/rvm/rubies/ruby-3.0.0/lib/ruby/3.0.0/rubygems.rb:281:in `find_spec_for_exe': can't find gem transrate (>= 0.a) with executable transrate (Gem::GemNotFoundException)
    from /usr/share/rvm/rubies/ruby-3.0.0/lib/ruby/3.0.0/rubygems.rb:300:in `activate_bin_path'
    from /usr/share/rvm/gems/ruby-3.0.0/bin/transrate:23:in `<main>'
/tmp/Rtmp5QQkuY/chunk-code-50e9682fe615.txt: line 17: --output: command not found
kubu4 commented 2 years ago

Did you try my first suggestion of just specifying the full path to transrate (with the rvm line)?

If yes, what happens with that?

kubu4 commented 2 years ago

After playing with this for a bit in R Studio on Raven, it's going to be way too difficult/time consuming to figure this out. You'll have to run transrate outside of R Studio.

sr320 commented 2 years ago

again I would ask why you are attempting to run transrate...

mattgeorgephd commented 2 years ago

@sr320 Since Sam went through the trouble of getting it on Raven I was hoping to get it to work to evaluate my trinity assembly :)

While we are on the topic - I think I'm running into a memory issue with trinity using the trossulus SRA:

After running this:

export PATH=/home/shared/samtools-1.12:$PATH
export PATH=/home/shared/jellyfish-2.3.0/bin:$PATH
export PATH=/home/shared/bowtie2-2.4.4-linux-x86_64:$PATH
export PATH=/home/shared/salmon-1.4.0_linux_x86_64/bin:$PATH

/home/shared/trinityrnaseq-v2.12.0/Trinity --seqType fa \
--single sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta \
--max_memory 200G \
--min_kmer_cov 2 \
--CPU 48 \
--output home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trinity_out/ 

I'm getting this:

     ______  ____   ____  ____   ____  ______  __ __
    |      ||    \ |    ||    \ |    ||      ||  |  |
    |      ||  D  ) |  | |  _  | |  | |      ||  |  |
    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |
      |  |  |    \  |  | |  |  | |  |   |  |  |___, |
      |  |  |  .  \ |  | |  |  | |  |   |  |  |     |
      |__|  |__|\_||____||__|__||____|  |__|  |____/

    Trinity-v2.12.0

Single read files: $VAR1 = [
          'sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta'
        ];
Trinity version: Trinity-v2.12.0
** NOTE: Latest version of Trinity is Trinity-v2.14.0, and can be obtained at:
    https://github.com/trinityrnaseq/trinityrnaseq/releases

Thursday, August 11, 2022: 11:08:36 CMD: java -Xmx64m -XX:ParallelGCThreads=2  -jar /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/ExitTester.jar 0
Thursday, August 11, 2022: 11:08:36 CMD: java -Xmx4g -XX:ParallelGCThreads=2  -jar /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/ExitTester.jar 1
Thursday, August 11, 2022: 11:08:36 CMD: mkdir -p /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trinity_out
Thursday, August 11, 2022: 11:08:36 CMD: mkdir -p /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trinity_out/chrysalis

----------------------------------------------------------------------------------
-------------- Trinity Phase 1: Clustering of RNA-Seq Reads  ---------------------
----------------------------------------------------------------------------------

---------------------------------------------------------------
------------ In silico Read Normalization ---------------------
-- (Removing Excess Reads Beyond 200 Coverage --
---------------------------------------------------------------

# running normalization on reads: $VAR1 = [
          [
            '/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta'
          ]
        ];

Thursday, August 11, 2022: 11:08:36 CMD: /home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normalization.pl --seqType fa --JM 200G  --max_cov 200 --min_cov 2 --CPU 48 --output /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trinity_out/insilico_read_normalization --max_CV 10000  --single /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta
-prepping seqs
CMD: cat /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta >> single.fa
CMD finished (5 seconds)
CMD: touch single.fa.ok
CMD finished (0 seconds)
-kmer counting.
-------------------------------------------
----------- Jellyfish  --------------------
-- (building a k-mer catalog from reads) --
-------------------------------------------

CMD: jellyfish count -t 48 -m 25 -s 100000000  --canonical  single.fa
CMD finished (22 seconds)
CMD: jellyfish histo -t 48 -o jellyfish.K25.min2.kmers.fa.histo mer_counts.jf
CMD finished (5 seconds)
CMD: jellyfish dump -L 2 mer_counts.jf > jellyfish.K25.min2.kmers.fa
CMD finished (14 seconds)
CMD: touch jellyfish.K25.min2.kmers.fa.success
CMD finished (0 seconds)
-generating stats files
CMD: /home/shared/trinityrnaseq-v2.12.0/util/..//Inchworm/bin/fastaToKmerCoverageStats --reads single.fa --kmers jellyfish.K25.min2.kmers.fa --kmer_size 25  --num_threads 48  --DS  > single.fa.K25.stats
-reading Kmer occurrences...

 done parsing 61143520 Kmers, 61143520 added, taking 57 seconds.
STATS_GENERATION_TIME: 0 seconds.
CMD finished (71 seconds)
CMD: touch single.fa.K25.stats.ok
CMD finished (0 seconds)
-sorting each stats file by read name.
CMD: head -n1 single.fa.K25.stats > single.fa.K25.stats.sort && tail -n +2 single.fa.K25.stats | /usr/bin/sort --parallel=48 -k1,1 -T . -S 200G >> single.fa.K25.stats.sort
CMD finished (0 seconds)
CMD: touch single.fa.K25.stats.sort.ok
CMD finished (0 seconds)
-defining normalized reads
CMD: /home/shared/trinityrnaseq-v2.12.0/util/..//util/support_scripts//nbkc_normalize.pl --stats_file single.fa.K25.stats.sort  --max_cov 200  --min_cov 2  --max_CV 10000 > single.fa.K25.stats.sort.maxC200.minC2.maxCV10000.accs
Error, no reads made it to the normalization process...   at /home/shared/trinityrnaseq-v2.12.0/util/..//util/support_scripts//nbkc_normalize.pl line 119.
Thread 1 terminated abnormally: Error, cmd: /home/shared/trinityrnaseq-v2.12.0/util/..//util/support_scripts//nbkc_normalize.pl --stats_file single.fa.K25.stats.sort  --max_cov 200  --min_cov 2  --max_CV 10000 > single.fa.K25.stats.sort.maxC200.minC2.maxCV10000.accs died with ret 65280 at /home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normalization.pl line 793.
Error, thread exited with error Error, cmd: /home/shared/trinityrnaseq-v2.12.0/util/..//util/support_scripts//nbkc_normalize.pl --stats_file single.fa.K25.stats.sort  --max_cov 200  --min_cov 2  --max_CV 10000 > single.fa.K25.stats.sort.maxC200.minC2.maxCV10000.accs died with ret 65280 at /home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normalization.pl line 793.

Error, 1 threads errored out at /home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normalization.pl line 997.
Error, cmd: /home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normalization.pl --seqType fa --JM 200G  --max_cov 200 --min_cov 2 --CPU 48 --output /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/trinity_out/insilico_read_normalization --max_CV 10000  --single /home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pilot/sequences/SRR6051615-mytilus_trossulus_transcriptome.fasta died with ret 7424 at /home/shared/trinityrnaseq-v2.12.0/Trinity line 2869.
    main::process_cmd("/home/shared/trinityrnaseq-v2.12.0/util/insilico_read_normali"...) called at /home/shared/trinityrnaseq-v2.12.0/Trinity line 3422
    main::normalize("/home/shared/8TB_HDD_02/mattgeorgephd/PSMFC-mytilus-byssus-pi"..., 200, ARRAY(0x5580aa8a5eb0)) called at /home/shared/trinityrnaseq-v2.12.0/Trinity line 3362
    main::run_normalization(200, ARRAY(0x5580aa8a5eb0)) called at /home/shared/trinityrnaseq-v2.12.0/Trinity line 1389
kubu4 commented 2 years ago

While we are on the topic - I think I'm running into a memory issue with trinity using the trossulus SRA:

Please create separate issue to address this.

kubu4 commented 2 years ago

While we are on the topic - I think I'm running into a memory issue with trinity using the trossulus SRA:

Please create separate issue to address this.

As a teaser, and provide motivation to get that new issue started, I know why that Trinity assembly failed...