RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

genome-based transcriptomics workflow #1738

Closed grace-ac closed 9 months ago

grace-ac commented 1 year ago

I've never done transcritpomics where there's been a reference genome! And i realized i haven't done it yet... and should for WSN... so anyway:

there's a published genome for Pycnopodia helianthoides: https://datadryad.org/stash/dataset/doi:10.5061/dryad.51c59zwfd

Based on the handbook for gene expression workflow it looks like it's:

  1. QC and trimming (I've already done this)
  2. Alignment of sequences to genome using HISAT2
  3. DESeq2 for diff expression

I've never used HISAT2 before, but it looks like there's some use cases in the handbook.

Is there anything else I'm missing? I'm pretty sure I don't use kallisto since I don't need to pseudo-align things?

sr320 commented 1 year ago

Yep- you got it! Hisat is way to go.

AHuffmyer commented 1 year ago

Yes exactly! Here is an example script I used for mapping TagSeq reads to a genome in corals (note the script will be a bit different for RNAseq): https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/TagSeq_BioInf_genomeV3.md

grace-ac commented 1 year ago

sweet!! thanks so much! I'll take a look and will post issues if i get stuck.

kubu4 commented 1 year ago

a published genome

I don't see the actual genome (i.e. a FastA) on that page - just various annotations...

grace-ac commented 1 year ago

yes that is exactly what I'm trying to find right now ha... i might not be looking at correct place.

grace-ac commented 1 year ago

https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=7614

grace-ac commented 1 year ago

I ran this in command line on Mox to download the .zip genome:

curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_032158295.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCA_032158295.1.zip" -H "Accept: application/zip"

the code was from ncbi genbank page: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_032158295.1/

I now have the .zip file in a directory on mox:
/gscratch/srlab/graceac9/data/pycno/genome

I need to unzip the file. I tried:

gunzip GCA_032158295.1.zip.gz

and it says:

gzip: GCA_032158295.1.zip: unknown suffix -- ignored
sr320 commented 1 year ago

URL to code? (Rmd or Quarto file)

kubu4 commented 1 year ago

It's not a gzip file (not sure where that filename came from), it's a zip file. So, you should run:

unzip GCA_032158295.1.zip
grace-ac commented 1 year ago

ran

unzip GCA_032158295.1.zip

error:

 End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of GCA_032158295.1.zip or
        GCA_032158295.1.zip.zip, and cannot find GCA_032158295.1.zip.ZIP, period.
sr320 commented 1 year ago

URL to code?

On Wed, Oct 25, 2023 at 11:32 AM Grace Crandall @.***> wrote:

ran

unzip GCA_032158295.1.zip

error:

End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of GCA_032158295.1.zip or GCA_032158295.1.zip.zip, and cannot find GCA_032158295.1.zip.ZIP, period.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1738*issuecomment-1779833116__;Iw!!K-Hz7m0Vt54!mNT_8LFojsYk9My0La-_YfuGDNuA5DKcxFtCwR7QEMHMH9F7ZnonlokLML57DLbQOyjzpDLtepA7PUBa_xugVRA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN2Y6B3CSH6SWDSZ4BDYBFLKBAVCNFSM6AAAAAA6OFCIL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZHAZTGMJRGY__;!!K-Hz7m0Vt54!mNT_8LFojsYk9My0La-_YfuGDNuA5DKcxFtCwR7QEMHMH9F7ZnonlokLML57DLbQOyjzpDLtepA7PUBahplvnzM$ . You are receiving this because you commented.Message ID: @.***>

grace-ac commented 1 year ago

does curl not work with Mox? @yaaminiv has assisted me greatly and we figured out i could do the curl code onto my computer, unzip, and now will rsync to mox

sr320 commented 1 year ago

URL to code?

On Wed, Oct 25, 2023 at 11:37 AM Grace Crandall @.***> wrote:

does curl not work with Mox? @yaaminiv https://urldefense.com/v3/__https://github.com/yaaminiv__;!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCtS_mYkmY$ has assisted me greatly and we figured out i could do the curl code onto my computer, unzip, and now will rsync to mox

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1738*issuecomment-1779841743__;Iw!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCt3yyN1kc$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN3KU672QLEFLRUB4JDYBFMANAVCNFSM6AAAAAA6OFCIL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZHA2DCNZUGM__;!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCt6rkGcmM$ . You are receiving this because you commented.Message ID: @.***>

sr320 commented 1 year ago

URL to code?

my bad - thought this as raven...

re curl and mox - need to be on login node... https://robertslab.github.io/resources/mox_Node-Types/

but rsync also works just fine.

kubu4 commented 1 year ago

@grace-ac - You're right - this command fails on Mox (I just tested on login and build nodes).

But, you didn't mention the error in your post: curl: (23) Failed writing body (1692 != 8186)

I've emailed UW IT, but glad you figured out the workaround.

grace-ac commented 1 year ago

ok things have been figured out by @yaaminiv !

did the curl code from this comment: https://github.com/RobertsLab/resources/issues/1738#issuecomment-1778215854 curled .zip into my downloads on laptop

then rsync .zip file from my downloads to mox user directory

now i'm going to work on the hisat2 script to align summer 2021 sea star QC'ed and trimmed RNAseq reads to the pycno genome

grace-ac commented 1 year ago

i have the .sh for hisat2 queued up on mox. (02-hisat2-summer2021.sh)

the queue says:

4963159     srlab gac_2021 graceac9 PD       0:00      1 (Resources)

I don't remember seeing (Resources) before. Is that typical? Is there something I need to change?

kubu4 commented 1 year ago

What does your script look like? More specifically, which node are you trying to use (i.e. 500GB or 120GB)? Someone is currently using one of the two nodes. So, I'm guessing they're already using the 500GB node. So, you can do one of the following:

OR

grace-ac commented 1 year ago

here's my script: 02-hisat2-summer2021.sh

and yes, I'm using the 500GB node, so I'll cancel this job and change to the 120GB node

grace-ac commented 1 year ago

link to my script as it is currently: code/02-hisat2-summer2021.sh

it ran for ~6 minutes, and then I got a slurm error message that is very long... Mox path to slurm-4963471.out:

/gscratch/srlab/graceac9/analyses/20231025-hisat2

message:

Settings:
  Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
  Time reading reference sizes: 00:00:04
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:02
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:09
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:02
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:03
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 1; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 6.04771e+07 (target: 90715620)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 8
  Reserving size (90715621) for bucket 1
  Calculating Z arrays for bucket 1
  Entering block accumulator loop for bucket 1:
  bucket 1: 10%
  bucket 1: 20%
  bucket 1: 30%
  bucket 1: 40%
  bucket 1: 50%
  bucket 1: 60%
  bucket 1: 70%
  bucket 1: 80%
  bucket 1: 90%
  bucket 1: 100%
  Sorting block of length 36899782 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:11
Returning block of 36899783 for bucket 1
Getting block 2 of 8
  Reserving size (90715621) for bucket 2
  Calculating Z arrays for bucket 2
  Entering block accumulator loop for bucket 2:
  bucket 2: 10%
  bucket 2: 20%
  bucket 2: 30%
  bucket 2: 40%
  bucket 2: 50%
  bucket 2: 60%
  bucket 2: 70%
  bucket 2: 80%
  bucket 2: 90%
  bucket 2: 100%
  Sorting block of length 73139168 for bucket 2
  (Using difference cover)
  Sorting block time: 00:00:22
Returning block of 73139169 for bucket 2
Getting block 3 of 8
  Reserving size (90715621) for bucket 3
  Calculating Z arrays for bucket 3
  Entering block accumulator loop for bucket 3:
  bucket 3: 10%
  bucket 3: 20%
  bucket 3: 30%
  bucket 3: 40%
  bucket 3: 50%
  bucket 3: 60%
  bucket 3: 70%
  bucket 3: 80%
  bucket 3: 90%
  bucket 3: 100%
  Sorting block of length 36948978 for bucket 3
  (Using difference cover)
  Sorting block time: 00:00:11
Returning block of 36948979 for bucket 3
Getting block 4 of 8
  Reserving size (90715621) for bucket 4
  Calculating Z arrays for bucket 4
  Entering block accumulator loop for bucket 4:
  bucket 4: 10%
  bucket 4: 20%
  bucket 4: 30%
  bucket 4: 40%
  bucket 4: 50%
  bucket 4: 60%
  bucket 4: 70%
  bucket 4: 80%
  bucket 4: 90%
kubu4 commented 1 year ago

Is that the full Slurm contents?

If not, please paste the last 25 lines or so..

grace-ac commented 1 year ago

that is full slurm contents

kubu4 commented 1 year ago

Wait. Did your email from Mox say the job completed or failed?

grace-ac commented 1 year ago

email said failed!

kubu4 commented 1 year ago

Thanks.

What was the error code?

kubu4 commented 1 year ago

Also, that script linked above is specifying 500G of memory, but you changed it, right? Can you please share current script?

grace-ac commented 1 year ago

Slurm Job_id=4963471 Name=gac_2021_hisat2 Failed, Run time 00:06:07, FAILED, ExitCode 1

kubu4 commented 1 year ago

Thanks.

grace-ac commented 1 year ago

oh! thanks for catching that- here's current script: 02-hisat2-summer2021.sh

kubu4 commented 1 year ago

Don't know if it actually makes a difference or not, but there shouldn't a B at the end of the memory specification. Should just be 120G.

Try making that change and re-run the script.

grace-ac commented 1 year ago

ok! just changed and re-running now. thanks! we'll see what happens

grace-ac commented 1 year ago

email says failed: Slurm Job_id=4963510 Name=gac_2021_hisat2 Failed, Run time 00:06:27, FAILED, ExitCode 1

full slurm output:

Settings:
  Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
  Time reading reference sizes: 00:00:04
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:03
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:09
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:02
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:03
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 1; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 6.04771e+07 (target: 90715620)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 8
  Reserving size (90715621) for bucket 1
  Calculating Z arrays for bucket 1
  Entering block accumulator loop for bucket 1:
  bucket 1: 10%
  bucket 1: 20%
  bucket 1: 30%
  bucket 1: 40%
  bucket 1: 50%
  bucket 1: 60%
  bucket 1: 70%
  bucket 1: 80%
  bucket 1: 90%
  bucket 1: 100%
  Sorting block of length 36899782 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:11
Returning block of 36899783 for bucket 1
Getting block 2 of 8
  Reserving size (90715621) for bucket 2
  Calculating Z arrays for bucket 2
  Entering block accumulator loop for bucket 2:
  bucket 2: 10%
  bucket 2: 20%
  bucket 2: 30%
  bucket 2: 40%
  bucket 2: 50%
  bucket 2: 60%
  bucket 2: 70%
  bucket 2: 80%
  bucket 2: 90%
  bucket 2: 100%
  Sorting block of length 73139168 for bucket 2
  (Using difference cover)
  Sorting block time: 00:00:22
Returning block of 73139169 for bucket 2
Getting block 3 of 8
  Reserving size (90715621) for bucket 3
  Calculating Z arrays for bucket 3
  Entering block accumulator loop for bucket 3:
  bucket 3: 10%
  bucket 3: 20%
  bucket 3: 30%
  bucket 3: 40%
  bucket 3: 50%
  bucket 3: 60%
  bucket 3: 70%
  bucket 3: 80%
  bucket 3: 90%
  bucket 3: 100%
  Sorting block of length 36948978 for bucket 3
  (Using difference cover)
  Sorting block time: 00:00:11
Returning block of 36948979 for bucket 3
Getting block 4 of 8
  Reserving size (90715621) for bucket 4
  Calculating Z arrays for bucket 4
  Entering block accumulator loop for bucket 4:
  bucket 4: 10%
  bucket 4: 20%
  bucket 4: 30%
  bucket 4: 40%
  bucket 4: 50%
  bucket 4: 60%
  bucket 4: 70%
  bucket 4: 80%
  bucket 4: 90%
kubu4 commented 1 year ago

Hmmm, Don't know what's happening. Sorry!

This is annoying, to say the least. No error message makes this virtually impossible to troubleshoot...

Try running on Raven?

kubu4 commented 1 year ago

You ended up with the expected number of output files, so I think the indexing finished.

~With that being the case, it makes me think this comment might be the issue: # called the reference genome (scaffolds)~

~Pretty sure bash can't handle in-line comments like that. I think they always have to be at the beginning of the line.~

~Remove (or move that comment to its own line) that comment and try re-running?~

EDITED: Added strikethrough to incorrect info.

AHuffmyer commented 1 year ago

One thing you can do to help trouble shoot is add these lines into the slurm header:

SBATCH --error="align_error" #if your job fails, the error report will be put in this file

SBATCH --output="align_output" #once your job is completed, any final job report comments will be put in this file

Sometimes messages relevant to a failed job will be put in either the error or output file. These will be generated in the working directory you are in. If you run that and open each file with 'less align_error' and 'less align_output' do you see any useful information?

grace-ac commented 1 year ago

I've only ever used raven for R.

to run a script on raven, it looks like

sbatch script.sh

doesn't work. looking into handbook and resources to see if it shows how to run a script for hisat on raven

kubu4 commented 1 year ago
  1. You can use R to run Bash, if you'd like. Connect to Rstudio Server. Then, create a bash chunk, instead of an R chunk. E.g.

```{bash} echo "This is my first bash chunk." ```

OR...

  1. To run a script, you can use any of the following (sbatch is only used for Mox/HPCs):

And, don't forget to make it executable!

To do that, run:

chmod +x script.sh

sr320 commented 1 year ago

You can run bash chunks in Rmd... similar to how you used juypter for kallisto... ping me on slack and I can show you via zoom

kubu4 commented 1 year ago

Also, why not keep using Mox? No one is using either of the nodes ATM...

grace-ac commented 1 year ago

Also, why not keep using Mox? No one is using either of the nodes ATM...

I'm still getting the same failed message ~6mins in to running the script, and I can't figure out why it's happening, so I thought maybe raven would work. Not sure! just wanted to try something new.

kubu4 commented 1 year ago

I see. Good idea. For the Mox script, did you change back over to use the 500GB node? Might be worth trying that, since it's a quick and easy change.

kubu4 commented 1 year ago

@grace-ac - I figured out why the job keeps dying!!! Mox is over our storage quota, as mentioned in #1743! Once we get that straightened out, things will work again!

grace-ac commented 1 year ago

yes that was it!!! it's running and I got an email that it has officially begun! yay!

grace-ac commented 1 year ago

never mind. still the same issue. I'll try other things and post here what happens

kubu4 commented 1 year ago

Well, since no one's responded to that issue just yet, I'd assume no one's deleted any data yet (meaning we're still over quota).

grace-ac commented 1 year ago

ooooh ok. i thought maybe they had since i got to a 7+minute run time . I'll wait a bit then. thanks!

kubu4 commented 1 year ago

Really, what you should do is set your working directory to use the scrubbed directory. This has terabytes upon terabytes of space to utilize. Data does get deleted after 21 days, though.

/gscratch/scrubbed

Create a directory there with your username.

Then, use that directory to do all your work. You can create your desired working directories for each job/project. In your SLURM script, set the --chdir to your desired subdirectory in gscratch/scrubbed.

grace-ac commented 1 year ago

ran in /gscratch/graceac9

align_error message:

Settings:
  Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
  Time reading reference sizes: 00:00:03
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:03
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Exited GFM loop
fchr[A]: 0
fchr[C]: 147268959
fchr[G]: 241923747
fchr[T]: 336597128
fchr[$]: 483816647
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 165673412 bytes to primary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.1.ht2
Wrote 120954168 bytes to secondary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 222469277 bytes to primary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.5.ht2
Wrote 123078258 bytes to secondary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HGFM constructor
Headers:
    len: 483816647
    gbwtLen: 483816648
    nodes: 483816648
    sz: 120954162
    gbwtSz: 120954163
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 30238541
    offsSz: 120954164
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 2519879
    numLines: 2519879
    gbwtTotLen: 161272256
    gbwtTotSz: 161272256
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 00:06:18
--read-lengths arg must be at least 20
HISAT2 version 2.2.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage: 
  hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

 Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  --qseq             query input files are in Illumina's qseq format
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 <m1>, <m2>, <r> are sequences themselves, not files
  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)
  -u/--upto <int>    stop after first <int> reads/pairs (no limit)
  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)
  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)
  --phred33          qualities are Phred+33 (default)
  --phred64          qualities are Phred+64
  --int-quals        qualities encoded as space-delimited integers

 Presets:                 Same as:
   --fast                 --no-repeat-index
   --sensitive            --bowtie2-dp 1 -k 30 --score-min L,0,-0.5
   --very-sensitive       --bowtie2-dp 2 -k 50 --score-min L,0,-1

 Alignment:
  --bowtie2-dp <int> use Bowtie2's dynamic programming alignment algorithm (0) - 0: no dynamic programming, 1: conditional dynamic programming, and 2: unconditional dynamic programming (slowest)
  --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
  --ignore-quals     treat all quality values as 30 on Phred scale (off)
  --nofw             do not align forward (original) version of read (off)
  --norc             do not align reverse-complement version of read (off)
  --no-repeat-index  do not use repeat index

 Spliced Alignment:
  --pen-cansplice <int>              penalty for a canonical splice site (0)
  --pen-noncansplice <int>           penalty for a non-canonical splice site (12)
  --pen-canintronlen <func>          penalty for long introns (G,-8,1) with canonical splice sites
  --pen-noncanintronlen <func>       penalty for long introns (G,-8,1) with noncanonical splice sites
  --min-intronlen <int>              minimum intron length (20)
  --max-intronlen <int>              maximum intron length (500000)
  --known-splicesite-infile <path>   provide a list of known splice sites
  --novel-splicesite-outfile <path>  report a list of splice sites
  --novel-splicesite-infile <path>   provide a list of novel splice sites
  --no-temp-splicesite               disable the use of splice sites found
  --no-spliced-alignment             disable spliced alignment
  --rna-strandness <string>          specify strand-specific information (unstranded)
  --tmo                              reports only those alignments within known transcriptome
  --dta                              reports alignments tailored for transcript assemblers
  --dta-cufflinks                    reports alignments tailored specifically for cufflinks
  --avoid-pseudogene                 tries to avoid aligning reads to pseudogenes (experimental option)
  --no-templatelen-adjustment        disables template length adjustment for RNA-seq reads

 Scoring:
  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>
  --sp <int>,<int>   max and min penalties for soft-clipping; lower qual = lower penalty <2,1>
  --no-softclip      no soft-clipping
  --np <int>         penalty for non-A/C/G/Ts in read/ref (1)
  --rdg <int>,<int>  read gap open, extend penalties (5,3)
  --rfg <int>,<int>  reference gap open, extend penalties (5,3)
  --score-min <func> min acceptable alignment score w/r/t read length
                     (L,0.0,-0.2)

 Reporting:
  -k <int>           It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean 
                     alignments whose alignment score is equal to or higher than any other alignments. The search terminates 
                     when it cannot find more distinct valid alignments, or when it finds <int>, whichever happens first. 
                     The alignment score for a paired-end alignment equals the sum of the alignment scores of 
                     the individual mates. Each reported read or pair alignment beyond the first has the SAM ‘secondary’ bit 
                     (which equals 256) set in its FLAGS field. For reads that have more than <int> distinct, 
                     valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible 
                     in terms of alignment score. Default: 5 (linear index) or 10 (graph index).
                     Note: HISAT2 is not designed with large values for -k in mind, and when aligning reads to long, 
                     repetitive genomes, large -k could make alignment much slower.
  --max-seeds <int>  HISAT2, like other aligners, uses seed-and-extend approaches. HISAT2 tries to extend seeds to 
                     full-length alignments. In HISAT2, --max-seeds is used to control the maximum number of seeds that 
                     will be extended. For DNA-read alignment (--no-spliced-alignment), HISAT2 extends up to these many seeds
                     and skips the rest of the seeds. For RNA-read alignment, HISAT2 skips extending seeds and reports 
                     no alignments if the number of seeds is larger than the number specified with the option, 
                     to be compatible with previous versions of HISAT2. Large values for --max-seeds may improve alignment 
                     sensitivity, but HISAT2 is not designed with large values for --max-seeds in mind, and when aligning 
                     reads to long, repetitive genomes, large --max-seeds could make alignment much slower. 
                     The default value is the maximum of 5 and the value that comes with -k times 2.
  -a/--all           HISAT2 reports all alignments it can find. Using the option is equivalent to using both --max-seeds 
                     and -k with the maximum value that a 64-bit signed integer can represent (9,223,372,036,854,775,807).
  --repeat           report alignments to repeat sequences directly

 Paired-end:
  -I/--minins <int>  minimum fragment length (0), only valid with --no-spliced-alignment
  -X/--maxins <int>  maximum fragment length (500), only valid with --no-spliced-alignment
  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
  --no-mixed         suppress unpaired alignments for paired reads
  --no-discordant    suppress discordant alignments for paired reads

 Output:
  -t/--time          print wall-clock time taken by search phases
  --un <path>           write unpaired reads that didn't align to <path>
  --al <path>           write unpaired reads that aligned at least once to <path>
  --un-conc <path>      write pairs that didn't align concordantly to <path>
  --al-conc <path>      write pairs that aligned concordantly at least once to <path>
  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
  --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
  --summary-file <path> print alignment summary to this file.
  --new-summary         print alignment summary in a new style, which is more machine-friendly.
  --quiet               print nothing to stderr except serious errors
  --met-file <path>     send metrics to file at <path> (off)
  --met-stderr          send metrics to stderr (off)
  --met <int>           report internal counters & metrics every <int> secs (1)
  --no-head             suppress header lines, i.e. lines starting with @
  --no-sq               suppress @SQ header lines
  --rg-id <text>        set read group id, reflected in @RG line and RG:Z: opt field
  --rg <text>           add <text> ("lab:value") to @RG line of SAM header.
                        Note: @RG line only printed when --rg-id is set.
  --omit-sec-seq        put '*' in SEQ and QUAL fields for secondary alignments.

 Performance:
  -o/--offrate <int> override offrate of index; must be >= index's offrate
  -p/--threads <int> number of alignment threads to launch (1)
  --reorder          force SAM output order to match order of input reads
  --mm               use memory-mapped I/O for index; many 'hisat2's can share

 Other:
  --qc-filter        filter out reads that are bad according to QSEQ filter
  --seed <int>       seed for random number generator (0)
  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes
  --remove-chrname   remove 'chr' from reference names in alignment
  --add-chrname      add 'chr' to reference names in alignment 
  --version          print version information and quit
  -h/--help          print this usage message
Error: Encountered internal HISAT2 exception (#1)
Command: /gscratch/srlab/programs/hisat2-2.2.0/hisat2-align-s --wrapper basic-0 -p 8 --dta -x /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref -S .sam --read-lengths 150,145,149,141,136,121,144,139,130,147,143,135,131,137,133,124,142,140,129,138,120,146,132,125,122,117,134,127,126,119,128,123,148,115,110,102,108,118,114,113,109,107,112,111,103,116,105,100,94,106,104,99,83,80,97,96,91,98,90,89,87,86,55,82,75,66,61,49,40,101,95,92,76,72,46,35,32,31,88,85,84,81,78,74,73,70,69,67,65,62,54,51,48,47,41,39,36,33,27,16 -U /tmp/586.unp 
(ERR): hisat2-align exited with value 1
kubu4 commented 1 year ago
  1. Although not the source of this error, please start running your jobs on /gscratch/scrubbed/graceac9/. Using the scrubbed directory is considered a "best practice" for Mox. Apologies for not realizing/mentioning that earlier. I'll make sure to update our Mox guidelines to reflect this.

  2. Please provide URL to code any time a new error pops up.

kubu4 commented 1 year ago

Anyway, looks like I've encountered this error myself!

https://github.com/DaehwanKimLab/hisat2/issues/245

In the process of installing the updated version (/gscratch/srlab/programs/hisat2-2.2.1/) as I type this.

I'll respond once installation is complete.

Then, switch to using the updated version.