Closed grace-ac closed 9 months ago
Yep- you got it! Hisat is way to go.
Yes exactly! Here is an example script I used for mapping TagSeq reads to a genome in corals (note the script will be a bit different for RNAseq): https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/TagSeq_BioInf_genomeV3.md
sweet!! thanks so much! I'll take a look and will post issues if i get stuck.
a published genome
I don't see the actual genome (i.e. a FastA) on that page - just various annotations...
yes that is exactly what I'm trying to find right now ha... i might not be looking at correct place.
I ran this in command line on Mox to download the .zip genome:
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_032158295.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCA_032158295.1.zip" -H "Accept: application/zip"
the code was from ncbi genbank page: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_032158295.1/
I now have the .zip file in a directory on mox:
/gscratch/srlab/graceac9/data/pycno/genome
I need to unzip the file. I tried:
gunzip GCA_032158295.1.zip.gz
and it says:
gzip: GCA_032158295.1.zip: unknown suffix -- ignored
URL to code? (Rmd or Quarto file)
It's not a gzip
file (not sure where that filename came from), it's a zip
file. So, you should run:
unzip GCA_032158295.1.zip
ran
unzip GCA_032158295.1.zip
error:
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of GCA_032158295.1.zip or
GCA_032158295.1.zip.zip, and cannot find GCA_032158295.1.zip.ZIP, period.
URL to code?
On Wed, Oct 25, 2023 at 11:32 AM Grace Crandall @.***> wrote:
ran
unzip GCA_032158295.1.zip
error:
End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of GCA_032158295.1.zip or GCA_032158295.1.zip.zip, and cannot find GCA_032158295.1.zip.ZIP, period.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1738*issuecomment-1779833116__;Iw!!K-Hz7m0Vt54!mNT_8LFojsYk9My0La-_YfuGDNuA5DKcxFtCwR7QEMHMH9F7ZnonlokLML57DLbQOyjzpDLtepA7PUBa_xugVRA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN2Y6B3CSH6SWDSZ4BDYBFLKBAVCNFSM6AAAAAA6OFCIL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZHAZTGMJRGY__;!!K-Hz7m0Vt54!mNT_8LFojsYk9My0La-_YfuGDNuA5DKcxFtCwR7QEMHMH9F7ZnonlokLML57DLbQOyjzpDLtepA7PUBahplvnzM$ . You are receiving this because you commented.Message ID: @.***>
does curl not work with Mox? @yaaminiv has assisted me greatly and we figured out i could do the curl code onto my computer, unzip, and now will rsync to mox
URL to code?
On Wed, Oct 25, 2023 at 11:37 AM Grace Crandall @.***> wrote:
does curl not work with Mox? @yaaminiv https://urldefense.com/v3/__https://github.com/yaaminiv__;!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCtS_mYkmY$ has assisted me greatly and we figured out i could do the curl code onto my computer, unzip, and now will rsync to mox
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1738*issuecomment-1779841743__;Iw!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCt3yyN1kc$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN3KU672QLEFLRUB4JDYBFMANAVCNFSM6AAAAAA6OFCIL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZHA2DCNZUGM__;!!K-Hz7m0Vt54!hJvavpxk9H365ivbiI0wFlHBSZlnyJoAfcnNLzAkwbEUDEl0c5i27YS2DvfflCtHsVU5RdxjAx1D4kCt6rkGcmM$ . You are receiving this because you commented.Message ID: @.***>
URL to code?
my bad - thought this as raven...
re curl and mox - need to be on login node... https://robertslab.github.io/resources/mox_Node-Types/
but rsync also works just fine.
@grace-ac - You're right - this command fails on Mox (I just tested on login and build nodes).
But, you didn't mention the error in your post: curl: (23) Failed writing body (1692 != 8186)
I've emailed UW IT, but glad you figured out the workaround.
ok things have been figured out by @yaaminiv !
did the curl code from this comment: https://github.com/RobertsLab/resources/issues/1738#issuecomment-1778215854 curled .zip into my downloads on laptop
then rsync .zip file from my downloads to mox user directory
now i'm going to work on the hisat2 script to align summer 2021 sea star QC'ed and trimmed RNAseq reads to the pycno genome
i have the .sh for hisat2 queued up on mox. (02-hisat2-summer2021.sh)
the queue says:
4963159 srlab gac_2021 graceac9 PD 0:00 1 (Resources)
I don't remember seeing (Resources) before. Is that typical? Is there something I need to change?
What does your script look like? More specifically, which node are you trying to use (i.e. 500GB or 120GB)? Someone is currently using one of the two nodes. So, I'm guessing they're already using the 500GB node. So, you can do one of the following:
OR
scancel 4963159
)here's my script: 02-hisat2-summer2021.sh
and yes, I'm using the 500GB node, so I'll cancel this job and change to the 120GB node
link to my script as it is currently: code/02-hisat2-summer2021.sh
it ran for ~6 minutes, and then I got a slurm error message that is very long... Mox path to slurm-4963471.out:
/gscratch/srlab/graceac9/analyses/20231025-hisat2
message:
Settings:
Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
Time reading reference sizes: 00:00:04
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:02
Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:09
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:02
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:03
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 1; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 6.04771e+07 (target: 90715620)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 8
Reserving size (90715621) for bucket 1
Calculating Z arrays for bucket 1
Entering block accumulator loop for bucket 1:
bucket 1: 10%
bucket 1: 20%
bucket 1: 30%
bucket 1: 40%
bucket 1: 50%
bucket 1: 60%
bucket 1: 70%
bucket 1: 80%
bucket 1: 90%
bucket 1: 100%
Sorting block of length 36899782 for bucket 1
(Using difference cover)
Sorting block time: 00:00:11
Returning block of 36899783 for bucket 1
Getting block 2 of 8
Reserving size (90715621) for bucket 2
Calculating Z arrays for bucket 2
Entering block accumulator loop for bucket 2:
bucket 2: 10%
bucket 2: 20%
bucket 2: 30%
bucket 2: 40%
bucket 2: 50%
bucket 2: 60%
bucket 2: 70%
bucket 2: 80%
bucket 2: 90%
bucket 2: 100%
Sorting block of length 73139168 for bucket 2
(Using difference cover)
Sorting block time: 00:00:22
Returning block of 73139169 for bucket 2
Getting block 3 of 8
Reserving size (90715621) for bucket 3
Calculating Z arrays for bucket 3
Entering block accumulator loop for bucket 3:
bucket 3: 10%
bucket 3: 20%
bucket 3: 30%
bucket 3: 40%
bucket 3: 50%
bucket 3: 60%
bucket 3: 70%
bucket 3: 80%
bucket 3: 90%
bucket 3: 100%
Sorting block of length 36948978 for bucket 3
(Using difference cover)
Sorting block time: 00:00:11
Returning block of 36948979 for bucket 3
Getting block 4 of 8
Reserving size (90715621) for bucket 4
Calculating Z arrays for bucket 4
Entering block accumulator loop for bucket 4:
bucket 4: 10%
bucket 4: 20%
bucket 4: 30%
bucket 4: 40%
bucket 4: 50%
bucket 4: 60%
bucket 4: 70%
bucket 4: 80%
bucket 4: 90%
Is that the full Slurm contents?
If not, please paste the last 25 lines or so..
that is full slurm contents
Wait. Did your email from Mox say the job completed or failed?
email said failed!
Thanks.
What was the error code?
Also, that script linked above is specifying 500G of memory, but you changed it, right? Can you please share current script?
Slurm Job_id=4963471 Name=gac_2021_hisat2 Failed, Run time 00:06:07, FAILED, ExitCode 1
Thanks.
oh! thanks for catching that- here's current script: 02-hisat2-summer2021.sh
Don't know if it actually makes a difference or not, but there shouldn't a B
at the end of the memory specification. Should just be 120G
.
Try making that change and re-run the script.
ok! just changed and re-running now. thanks! we'll see what happens
email says failed: Slurm Job_id=4963510 Name=gac_2021_hisat2 Failed, Run time 00:06:27, FAILED, ExitCode 1
full slurm output:
Settings:
Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
Time reading reference sizes: 00:00:04
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:03
Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:09
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:02
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:03
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 1; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 6.04771e+07 (target: 90715620)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 8
Reserving size (90715621) for bucket 1
Calculating Z arrays for bucket 1
Entering block accumulator loop for bucket 1:
bucket 1: 10%
bucket 1: 20%
bucket 1: 30%
bucket 1: 40%
bucket 1: 50%
bucket 1: 60%
bucket 1: 70%
bucket 1: 80%
bucket 1: 90%
bucket 1: 100%
Sorting block of length 36899782 for bucket 1
(Using difference cover)
Sorting block time: 00:00:11
Returning block of 36899783 for bucket 1
Getting block 2 of 8
Reserving size (90715621) for bucket 2
Calculating Z arrays for bucket 2
Entering block accumulator loop for bucket 2:
bucket 2: 10%
bucket 2: 20%
bucket 2: 30%
bucket 2: 40%
bucket 2: 50%
bucket 2: 60%
bucket 2: 70%
bucket 2: 80%
bucket 2: 90%
bucket 2: 100%
Sorting block of length 73139168 for bucket 2
(Using difference cover)
Sorting block time: 00:00:22
Returning block of 73139169 for bucket 2
Getting block 3 of 8
Reserving size (90715621) for bucket 3
Calculating Z arrays for bucket 3
Entering block accumulator loop for bucket 3:
bucket 3: 10%
bucket 3: 20%
bucket 3: 30%
bucket 3: 40%
bucket 3: 50%
bucket 3: 60%
bucket 3: 70%
bucket 3: 80%
bucket 3: 90%
bucket 3: 100%
Sorting block of length 36948978 for bucket 3
(Using difference cover)
Sorting block time: 00:00:11
Returning block of 36948979 for bucket 3
Getting block 4 of 8
Reserving size (90715621) for bucket 4
Calculating Z arrays for bucket 4
Entering block accumulator loop for bucket 4:
bucket 4: 10%
bucket 4: 20%
bucket 4: 30%
bucket 4: 40%
bucket 4: 50%
bucket 4: 60%
bucket 4: 70%
bucket 4: 80%
bucket 4: 90%
Hmmm, Don't know what's happening. Sorry!
This is annoying, to say the least. No error message makes this virtually impossible to troubleshoot...
Try running on Raven?
You ended up with the expected number of output files, so I think the indexing finished.
~With that being the case, it makes me think this comment might be the issue: # called the reference genome (scaffolds)
~
~Pretty sure bash can't handle in-line comments like that. I think they always have to be at the beginning of the line.~
~Remove (or move that comment to its own line) that comment and try re-running?~
EDITED: Added strikethrough to incorrect info.
One thing you can do to help trouble shoot is add these lines into the slurm header:
Sometimes messages relevant to a failed job will be put in either the error or output file. These will be generated in the working directory you are in. If you run that and open each file with 'less align_error' and 'less align_output' do you see any useful information?
I've only ever used raven for R.
to run a script on raven, it looks like
sbatch script.sh
doesn't work. looking into handbook and resources to see if it shows how to run a script for hisat on raven
```{bash} echo "This is my first bash chunk." ```
OR...
sbatch
is only used for Mox/HPCs):bash script.sh
./script.sh
/full/path/script.sh
And, don't forget to make it executable!
To do that, run:
chmod +x script.sh
You can run bash chunks in Rmd... similar to how you used juypter for kallisto... ping me on slack and I can show you via zoom
Also, why not keep using Mox? No one is using either of the nodes ATM...
Also, why not keep using Mox? No one is using either of the nodes ATM...
I'm still getting the same failed message ~6mins in to running the script, and I can't figure out why it's happening, so I thought maybe raven would work. Not sure! just wanted to try something new.
I see. Good idea. For the Mox script, did you change back over to use the 500GB node? Might be worth trying that, since it's a quick and easy change.
@grace-ac - I figured out why the job keeps dying!!! Mox is over our storage quota, as mentioned in #1743! Once we get that straightened out, things will work again!
yes that was it!!! it's running and I got an email that it has officially begun! yay!
never mind. still the same issue. I'll try other things and post here what happens
Well, since no one's responded to that issue just yet, I'd assume no one's deleted any data yet (meaning we're still over quota).
ooooh ok. i thought maybe they had since i got to a 7+minute run time . I'll wait a bit then. thanks!
Really, what you should do is set your working directory to use the scrubbed directory. This has terabytes upon terabytes of space to utilize. Data does get deleted after 21 days, though.
/gscratch/scrubbed
Create a directory there with your username.
Then, use that directory to do all your work. You can create your desired working directories for each job/project. In your SLURM script, set the --chdir
to your desired subdirectory in gscratch/scrubbed
.
ran in /gscratch/graceac9
align_error message:
Settings:
Output files: "/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.*.ht2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
Reading reference sizes
Time reading reference sizes: 00:00:03
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:03
Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 90715621 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 90715621 --dcv 1024
Constructing suffix-array element generator
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Exited GFM loop
fchr[A]: 0
fchr[C]: 147268959
fchr[G]: 241923747
fchr[T]: 336597128
fchr[$]: 483816647
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 165673412 bytes to primary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.1.ht2
Wrote 120954168 bytes to secondary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 222469277 bytes to primary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.5.ht2
Wrote 123078258 bytes to secondary GFM file: /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HGFM constructor
Headers:
len: 483816647
gbwtLen: 483816648
nodes: 483816648
sz: 120954162
gbwtSz: 120954163
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 0
eftabSz: 0
ftabLen: 1048577
ftabSz: 4194308
offsLen: 30238541
offsSz: 120954164
lineSz: 64
sideSz: 64
sideGbwtSz: 48
sideGbwtLen: 192
numSides: 2519879
numLines: 2519879
gbwtTotLen: 161272256
gbwtTotSz: 161272256
reverse: 0
linearFM: Yes
Total time for call to driver() for forward index: 00:06:18
--read-lengths arg must be at least 20
HISAT2 version 2.2.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-c <m1>, <m2>, <r> are sequences themselves, not files
-s/--skip <int> skip the first <int> reads/pairs in the input (none)
-u/--upto <int> stop after first <int> reads/pairs (no limit)
-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
Presets: Same as:
--fast --no-repeat-index
--sensitive --bowtie2-dp 1 -k 30 --score-min L,0,-0.5
--very-sensitive --bowtie2-dp 2 -k 50 --score-min L,0,-1
Alignment:
--bowtie2-dp <int> use Bowtie2's dynamic programming alignment algorithm (0) - 0: no dynamic programming, 1: conditional dynamic programming, and 2: unconditional dynamic programming (slowest)
--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
--no-repeat-index do not use repeat index
Spliced Alignment:
--pen-cansplice <int> penalty for a canonical splice site (0)
--pen-noncansplice <int> penalty for a non-canonical splice site (12)
--pen-canintronlen <func> penalty for long introns (G,-8,1) with canonical splice sites
--pen-noncanintronlen <func> penalty for long introns (G,-8,1) with noncanonical splice sites
--min-intronlen <int> minimum intron length (20)
--max-intronlen <int> maximum intron length (500000)
--known-splicesite-infile <path> provide a list of known splice sites
--novel-splicesite-outfile <path> report a list of splice sites
--novel-splicesite-infile <path> provide a list of novel splice sites
--no-temp-splicesite disable the use of splice sites found
--no-spliced-alignment disable spliced alignment
--rna-strandness <string> specify strand-specific information (unstranded)
--tmo reports only those alignments within known transcriptome
--dta reports alignments tailored for transcript assemblers
--dta-cufflinks reports alignments tailored specifically for cufflinks
--avoid-pseudogene tries to avoid aligning reads to pseudogenes (experimental option)
--no-templatelen-adjustment disables template length adjustment for RNA-seq reads
Scoring:
--mp <int>,<int> max and min penalties for mismatch; lower qual = lower penalty <6,2>
--sp <int>,<int> max and min penalties for soft-clipping; lower qual = lower penalty <2,1>
--no-softclip no soft-clipping
--np <int> penalty for non-A/C/G/Ts in read/ref (1)
--rdg <int>,<int> read gap open, extend penalties (5,3)
--rfg <int>,<int> reference gap open, extend penalties (5,3)
--score-min <func> min acceptable alignment score w/r/t read length
(L,0.0,-0.2)
Reporting:
-k <int> It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean
alignments whose alignment score is equal to or higher than any other alignments. The search terminates
when it cannot find more distinct valid alignments, or when it finds <int>, whichever happens first.
The alignment score for a paired-end alignment equals the sum of the alignment scores of
the individual mates. Each reported read or pair alignment beyond the first has the SAM ‘secondary’ bit
(which equals 256) set in its FLAGS field. For reads that have more than <int> distinct,
valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible
in terms of alignment score. Default: 5 (linear index) or 10 (graph index).
Note: HISAT2 is not designed with large values for -k in mind, and when aligning reads to long,
repetitive genomes, large -k could make alignment much slower.
--max-seeds <int> HISAT2, like other aligners, uses seed-and-extend approaches. HISAT2 tries to extend seeds to
full-length alignments. In HISAT2, --max-seeds is used to control the maximum number of seeds that
will be extended. For DNA-read alignment (--no-spliced-alignment), HISAT2 extends up to these many seeds
and skips the rest of the seeds. For RNA-read alignment, HISAT2 skips extending seeds and reports
no alignments if the number of seeds is larger than the number specified with the option,
to be compatible with previous versions of HISAT2. Large values for --max-seeds may improve alignment
sensitivity, but HISAT2 is not designed with large values for --max-seeds in mind, and when aligning
reads to long, repetitive genomes, large --max-seeds could make alignment much slower.
The default value is the maximum of 5 and the value that comes with -k times 2.
-a/--all HISAT2 reports all alignments it can find. Using the option is equivalent to using both --max-seeds
and -k with the maximum value that a 64-bit signed integer can represent (9,223,372,036,854,775,807).
--repeat report alignments to repeat sequences directly
Paired-end:
-I/--minins <int> minimum fragment length (0), only valid with --no-spliced-alignment
-X/--maxins <int> maximum fragment length (500), only valid with --no-spliced-alignment
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed suppress unpaired alignments for paired reads
--no-discordant suppress discordant alignments for paired reads
Output:
-t/--time print wall-clock time taken by search phases
--un <path> write unpaired reads that didn't align to <path>
--al <path> write unpaired reads that aligned at least once to <path>
--un-conc <path> write pairs that didn't align concordantly to <path>
--al-conc <path> write pairs that aligned concordantly at least once to <path>
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
--summary-file <path> print alignment summary to this file.
--new-summary print alignment summary in a new style, which is more machine-friendly.
--quiet print nothing to stderr except serious errors
--met-file <path> send metrics to file at <path> (off)
--met-stderr send metrics to stderr (off)
--met <int> report internal counters & metrics every <int> secs (1)
--no-head suppress header lines, i.e. lines starting with @
--no-sq suppress @SQ header lines
--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field
--rg <text> add <text> ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
Performance:
-o/--offrate <int> override offrate of index; must be >= index's offrate
-p/--threads <int> number of alignment threads to launch (1)
--reorder force SAM output order to match order of input reads
--mm use memory-mapped I/O for index; many 'hisat2's can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed <int> seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--remove-chrname remove 'chr' from reference names in alignment
--add-chrname add 'chr' to reference names in alignment
--version print version information and quit
-h/--help print this usage message
Error: Encountered internal HISAT2 exception (#1)
Command: /gscratch/srlab/programs/hisat2-2.2.0/hisat2-align-s --wrapper basic-0 -p 8 --dta -x /gscratch/scrubbed/graceac9/ncbi_dataset/data/GCA_032158295.1/Phelianthoides_ref -S .sam --read-lengths 150,145,149,141,136,121,144,139,130,147,143,135,131,137,133,124,142,140,129,138,120,146,132,125,122,117,134,127,126,119,128,123,148,115,110,102,108,118,114,113,109,107,112,111,103,116,105,100,94,106,104,99,83,80,97,96,91,98,90,89,87,86,55,82,75,66,61,49,40,101,95,92,76,72,46,35,32,31,88,85,84,81,78,74,73,70,69,67,65,62,54,51,48,47,41,39,36,33,27,16 -U /tmp/586.unp
(ERR): hisat2-align exited with value 1
Although not the source of this error, please start running your jobs on /gscratch/scrubbed/graceac9/
. Using the scrubbed
directory is considered a "best practice" for Mox. Apologies for not realizing/mentioning that earlier. I'll make sure to update our Mox guidelines to reflect this.
Please provide URL to code any time a new error pops up.
Anyway, looks like I've encountered this error myself!
https://github.com/DaehwanKimLab/hisat2/issues/245
In the process of installing the updated version (/gscratch/srlab/programs/hisat2-2.2.1/
) as I type this.
I'll respond once installation is complete.
Then, switch to using the updated version.
I've never done transcritpomics where there's been a reference genome! And i realized i haven't done it yet... and should for WSN... so anyway:
there's a published genome for Pycnopodia helianthoides: https://datadryad.org/stash/dataset/doi:10.5061/dryad.51c59zwfd
Based on the handbook for gene expression workflow it looks like it's:
I've never used HISAT2 before, but it looks like there's some use cases in the handbook.
Is there anything else I'm missing? I'm pretty sure I don't use kallisto since I don't need to pseudo-align things?