Closed cea295933 closed 6 months ago
ok ... trying both of these
even now, I can find the "counts.tsv" in the working folder ... is that complete at this point of the pipeline? if so, I can use that to run DESeq2 in R myself ... (at least as long as this now works on the complete dataset ... haven't tried that yet)
those don't appear to work ... attaching the files in the work directory in the event they spark any ideas files.zip
Hi, hmm looks like you have all novel transcripts - does BK006934.2 match up with the reference file fasta header?
oh sorry i see you transcript_ids are all 'unassigned_transcript_1234' is that expected?
yeah, that's strange. on one of my previous runs I didn't see that ... I don't remember now whether it was refseq or genbank (or it could be a gff vs. gtf) .... I'll try to figure what the combination was, but that all_counts.tsv is below all_counts.txt
ok ... if I run using the GenBank .fa and .gff, I get transcripts assigned. But if I replace with the .gtf (or use any combination of the RefSeq files), I get either unassigned transcripts or MSTRG (which I think come from stringtie). But I appear to get the same/similar error for all.
NCBI Gengank .fa and .gff 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 rna-YNCL0010C 1369.0 1222.0 1721.0 2023.0 3 rna-YNCL0012C 1148.0 1044.0 1435.0 1717.0 4 rna-YLR154C-G 274.0 212.0 214.0 369.0 5 rna-YNCH0007W 230.0 212.0 372.0 455.0 6 rna-YNCL0016C 208.0 165.0 276.0 299.0 7 MSTRG.1131.1 188.0 151.0 107.0 112.0 8 rna-YBR118W 175.0 147.0 156.0 147.0 9 rna-YGL103W 150.0 163.0 92.0 114.0 10 rna-YLR110C 142.0 152.0 162.0 196.0
Error
2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2385
2386 Caused by:
2387 Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1)
2388
2389 Command executed:
2390
2391 mkdir merged
2392 mkdir de_analysis
2393 mv de_transcript_counts.tsv merged/all_counts.tsv
2394 mv BarcodesPoly.csv de_analysis/coldata.tsv
2395 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2396
2397 Command exit status:
2398 1
2399
2400 Command output:
2401 Loading counts, conditions and parameters.
2402 Loading annotation database.
2403
2404 Command error:
2405 Loading counts, conditions and parameters.
2406 Warning message:
2407 In read.table(file = file, header = header, sep = sep, quote = quote, :
2408 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2409 Loading annotation database.
2410 Import genomic features from the file as a GRanges object ... OK
2411 Prepare the 'metadata' data frame ... OK
2412 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2413 values in 'transcripts$tx_strand' must be "+" or "-"
2414 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2415 Execution halted
2416
2417 Work dir:
2418 /work/epi2me/work/d3/fd1f38bd3fe7e3c1162b91b2ac41c5
NCBI Genbank .fa and .gtf 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 unassigned_transcript_4097 1363.0 1221.0 1722.0 2025.0 3 unassigned_transcript_4099 1150.0 1042.0 1436.0 1719.0 4 unassigned_transcript_4110 278.0 215.0 209.0 365.0 5 unassigned_transcript_2712 230.0 212.0 372.0 455.0 6 unassigned_transcript_4104 208.0 166.0 277.0 298.0 7 MSTRG.1131.1 188.0 151.0 107.0 112.0 8 unassigned_transcript_350 175.0 147.0 156.0 147.0 9 unassigned_transcript_2160 150.0 163.0 92.0 114.0 10 unassigned_transcript_4051 142.0 152.0 162.0 196.0
Error
2382 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2383
2384 Caused by:
2385 Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1)
2386
2387 Command executed:
2388
2389 mkdir merged
2390 mkdir de_analysis
2391 mv de_transcript_counts.tsv merged/all_counts.tsv
2392 mv BarcodesPoly.csv de_analysis/coldata.tsv
2393 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2394
2395 Command exit status:
2396 1
2397
2398 Command output:
2399 Loading counts, conditions and parameters.
2400 Loading annotation database.
2401
2402 Command error:
2403 Loading counts, conditions and parameters.
2404 Warning message:
2405 In read.table(file = file, header = header, sep = sep, quote = quote, :
2406 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2407 Loading annotation database.
2408 Import genomic features from the file as a GRanges object ... OK
2409 Prepare the 'metadata' data frame ... OK
2410 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2411 values in 'transcripts$tx_strand' must be "+" or "-"
2412 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2413 Execution halted
2414
2415 Work dir:
2416 /work/epi2me/work/9c/84eec819b9dfa17fda6c0626238884
NCBI Refseq .fa and .gff 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 MSTRG.263.1 188.0 151.0 107.0 112.0 3 MSTRG.289.1 159.0 175.0 96.0 121.0 4 MSTRG.457.1 152.0 130.0 94.0 119.0 5 MSTRG.511.1 148.0 157.0 167.0 198.0 6 MSTRG.32.1 140.0 134.0 114.0 106.0 7 MSTRG.768.1 127.0 133.0 81.0 81.0 8 MSTRG.61.1 113.0 96.0 111.0 100.0 9 MSTRG.198.1 112.0 82.0 39.0 38.0 10 MSTRG.515.1 109.0 107.0 74.0 72.0
Error
2283 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2284
2285 Caused by:
2286 Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1)
2287
2288 Command executed:
2289
2290 mkdir merged
2291 mkdir de_analysis
2292 mv de_transcript_counts.tsv merged/all_counts.tsv
2293 mv BarcodesPoly.csv de_analysis/coldata.tsv
2294 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2295
2296 Command exit status:
2297 1
2298
2299 Command output:
2300 Loading counts, conditions and parameters.
2301 Loading annotation database.
2302 Filtering counts using DRIMSeq.
2303
2304 Command error:
2305 Loading counts, conditions and parameters.
2306 Warning message:
2307 In read.table(file = file, header = header, sep = sep, quote = quote, :
2308 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2309 Loading annotation database.
2310 Import genomic features from the file as a GRanges object ... OK
2311 Prepare the 'metadata' data frame ... OK
2312 Make the TxDb object ... OK
2313 'select()' returned 1:many mapping between keys and columns
2314 Filtering counts using DRIMSeq.
2315 Error in dmDSdata(counts = counts, samples = coldata) :
2316 mode(counts) %in% "numeric" is not TRUE
2317 Calls: dmDSdata -> stopifnot
2318 Execution halted
2319
2320 Work dir:
2321 /work/epi2me/work/7f/bbb4ae1a2db55fea2627f197d32994
NCBI Refseq .fa and .gtf 1 Reference MUTpoly2 WTpoly1 MUTpoly1 WTpoly2 2 MSTRG.263.1 188.0 107.0 151.0 112.0 3 MSTRG.289.1 159.0 96.0 175.0 121.0 4 MSTRG.457.1 152.0 94.0 130.0 119.0 5 MSTRG.511.1 148.0 167.0 157.0 198.0 6 MSTRG.32.1 140.0 115.0 135.0 106.0 7 MSTRG.768.1 127.0 81.0 133.0 81.0 8 MSTRG.61.1 113.0 111.0 96.0 100.0 9 MSTRG.198.1 112.0 39.0 82.0 38.0 10 MSTRG.515.1 109.0 74.0 107.0 72.0
Error
2206 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2207
2208 Caused by:
2209 Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1)
2210
2211 Command executed:
2212
2213 mkdir merged
2214 mkdir de_analysis
2215 mv de_transcript_counts.tsv merged/all_counts.tsv
2216 mv BarcodesPoly.csv de_analysis/coldata.tsv
2217 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2218
2219 Command exit status:
2220 1
2221
2222 Command output:
2223 Loading counts, conditions and parameters.
2224 Loading annotation database.
2225 Filtering counts using DRIMSeq.
2226
2227 Command error:
2228 Loading counts, conditions and parameters.
2229 Warning message:
2230 In read.table(file = file, header = header, sep = sep, quote = quote, :
2231 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2232 Loading annotation database.
2233 Import genomic features from the file as a GRanges object ... OK
2234 Prepare the 'metadata' data frame ... OK
2235 Make the TxDb object ... OK
2236 'select()' returned 1:many mapping between keys and columns
2237 Filtering counts using DRIMSeq.
2238 Error in dmDSdata(counts = counts, samples = coldata) :
2239 mode(counts) %in% "numeric" is not TRUE
2240 Calls: dmDSdata -> stopifnot
2241 Execution halted
2242
2243 Work dir:
2244 /work/epi2me/work/01/c91d7314e0a6a9978ccec290d477f9
Sorry, I feel like its nearly there! Does it retry that DE_analysis step? I would expect that error but it to retry that step 4 times and hopefully one of the retries will work, as we iterate through trying a few different gtf2/gff3 R options. Check you are on the latest version within the app (if you are in the app). Or try nextflow pull epi2me-labs/wf-transcriptomes
if you are on cmd line to check you have the latest version.
I think so too … thanks to you!
It looks like it retries twice but fails at that moment:
2379 [27/4f170f] process > output (13) [ 69%] 16 of 23
2380 Retry deAnalysis with gff format setting and version removal.
2381 Retry deAnalysis with gtf format setting and version removal.
2382 [58/b882f6] NOTE: Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1) -- Execution is retried (2)
2383 [a4/ff1720] NOTE: Process pipeline:differential_expression:deAnalysis (1)
terminated with an error exit status (1) -- Execution is retried (3)
2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
I tried the update but received the below error. Perhaps I need to ask my sysadmin to run it? Is it ok to run this on the head node or do we need to run it on the compute nodes?
@.***:/work/epi2me[07:47]$ ./nextflow pull epi2me-labs/wf-transcriptomes Checking epi2me-labs/wf-transcriptomes ... epi2me-labs/wf-transcriptomes contains uncommitted changes -- cannot pull from repository
Colin Echeverría Aitken
Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430
On Nov 21, 2023, at 4:17 AM, Sarah Griffiths @.***> wrote:
nextflow pull epi2me-labs/wf-transcriptomes
No i'm sorry its still not working!
Ok on the files in your files.zip I did this to find the two which have strand set to . ,
grep -P '\t\.\t\.\t' annotation.gtf
BK006938.2 StringTie transcript 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1";
BK006938.2 StringTie exon 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1";
Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-"
, If I run the downstream DE_analysis without these lines it seems to complete.
This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf
which might fix it
grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf
Above and beyond! … where can I find the subworkflows directory? I tried find, but it’s taking a while ...
Colin Echeverría Aitken
Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430
On Nov 21, 2023, at 12:45 PM, Sarah Griffiths @.***> wrote:
No i'm sorry its still not working!
Ok on the files in your files.zip I did this to find the two which have strand set to . ,
grep -P '\t.\t.\t' annotation.gtf BK006938.2 StringTie transcript 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; BK006938.2 StringTie exon 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1"; Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete.
This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it
grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf mv tmp.gtf annotation.gtf — Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-transcriptomes/issues/45#issuecomment-1821380550, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD2SWUYDNC4OOW4SMMFYKVDYFTSEBAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGM4DANJVGA. You are receiving this because you were mentioned.
ok .. I think I found it. Before line 92, right? I see the de_analysis.R script being called on line 92, and it appears to use the annotation.gtf as an input if I understand if correctly
ok, I changed differential_expression.nf as below:
""" 88 mkdir merged 89 mkdir de_analysis 90 mv $merged_tsv merged/all_counts.tsv 91 mv $sample_sheet de_analysis/coldata.tsv 92 93 grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf 94 mv tmp.gtf annotation.gtf 95 96 de_analysis.R annotation.gtf $params.min_samps_gene_expr $params.min_samps_feature_expr $params.min_gene_expr $params.min_feature_expr $annotation_type $strip_version
but now I get the below the moment I launch the workflow, so I think I made the edit incorrectly (sorry!)
ERROR ~ Module compilation error
ok sorry forgot to escape the '\'
so should of been -
grep -v -P '\\t\\.\\t\\.\\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf
completed (with the downsampled data set)!!! Huzzah! I just launched jobs for the full data set to see if those work.
Thank you!!!
ah okay yes, now the challenge is the full data set - let me know how it goes
completed! Thank you so much. One question: is there a meaningful difference in the pipeline with and without the minimap2 option we incorporated ("w 50")? Trying to determine whether it's worth testing it without that flag or not, now that everything works.
Good point! It may result in a few more unaligned reads, so you may be missing some transcripts. Now its working it might be worth trying with the standard "w 10" to see if it still works, or "w 25".
Closing through lack of response and issues solves
My apologies for not closing the loop here. Yes, this is now resolved. Running smoothly on the full dataset and with either set of minimap2 options. Thank you for all of your help!
Colin Echeverría Aitken
Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430
On Dec 12, 2023, at 5:57 AM, Sarah Griffiths @.***> wrote:
Closing through lack of response and issues solves
— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-transcriptomes/issues/45#issuecomment-1851808378, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD2SWU4W22IZMKFLKI7GWKLYJA2DNAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJRHAYDQMZXHA. You are receiving this because you were mentioned.
Ask away!
Hello,
My workflow seems to be failing as a result of what I think is a samtools error.
Command: minimap2 -t 4 -ax splice -uf -p 1.0 "genome_index.mmi" "seqs.fastq.gz" | samtools view -Sb > "output.bam" 3499 samtools sort -@ 4 "output.bam" -o "WTpoly2_reads_aln_sorted.bam"
Error: [WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index. [main_samview] fail to read the header from "-".
I've taken a quick look at the files being maniupulated here (in the offending work directory). The .mmi does not appear to be human readable but is at least not empty (I can provide if you like ... GitHub won't allow me to upload). And seqs.fastq.gz is certainly not empty: 5G compressed and 20G uncompressed (I am pasting just the first entry for your reference below). However, output.bam is empty, and so I think this causes the issue for the samtools sort command. Any idea what could be causing this?
@4aabeb37-0ae5-4289-ab27-e6f246a7e21a runid=215283ee2a07610bc86fc4baad99de671b475048 read=74966 ch=2198 start_time=2023-09-23T00:45:55.980051-04:00 flow_cell_id=PAO05278 protocol_group_id=Vassar_smm092223 sample_id=NB_A11-H11 barcode=barcode84 barcode_alias=barcode84 parent_read_id=4aabeb37-0ae5-4289-ab27-e6f246a7e21a basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 ATGTTATGTCCTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAAGTAGTGGACCTAGAACCTGTGCCACAGCACCTCGCCTACCGTGACAAGAAAGTTGTCGGTGTCTTTGTGACTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTGCCGATGTGATTCTAAAGATGTCATTCATATATTCAAGTTAAAAAATATATAGAATTGATAAGCCATATATAGAAACAATTACTAAAGAGAGAAAAAAAACCAAGTCTGGTAAAAGAAAAAGAAAAAGAGCCTAATATACGACACTTTGGTTCAAATAATTCACCAGAAGTCCTACGTTTTATCCTCCACTTATTATTATATCATCTTGTTAATTTTATTTCTTTAAAATTTAGCAAATTGCTTGTTGGCCTTACCGGCAGCAGCAGAGACCTTGACGACGTTGAATCTAACAGTCTTGGAGATTGGTCTACATTGACCAACGGTAACAATGTCACCAACTTGGACACGGAAAGCTGGGGAGACGTGGACTGGGACGTTCTTGTGTCTCTTTTCGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCTCTTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTATCAATGTAAGAACCTTCAATAGCGGTCTTTGGGGTCTTGAATCCCAAACCGGCATTCTTATACCATCTCTTGGTTCTCTTGGAAGTCTTGACCTTTGGATTGTTGAAGATGTGAGGTTGCTTTTGGAAAGCTCTTTAGATTGAACAGTTAATTCAGTGGACATCTTTTTTCCCTGGCTTGATACCCCAAAGTCCAAGGGCAATTCGAACTGGAAAGCAATATCAGCACCAACAGAAACACAAAGACACCGACAACTTTCTTGTCACC + %%&'(++)%%(+,++++235891122><=>>FIDDCCFD=714211<BA?BBHGITFFEDDEEEE{HIPCBA@@?@ADJPHH{I{NHE=<<=BCECEIDGHGNHFJHI{{JHPFDICE{F@@AAFBCCBI=888KHFNL{HJEGDDDEC=EHEHJ@D67>?ADRI{{{{{{{{{HBFCIJB?=.,,9{>:9433267998:99:?>E{J{P{;:99@@B<<?{GFKEL{{GICCBFGGHDIHGFEFE{{{{EHN{IHFMHDDB@ACB;{{{{2..AA?@@@@BB??ANNHJJ{G{@?FC;>?DHME=>AAG>???C@?777;>@BW>@?AFQEJSIJHIHFGH{F{QIGEH{{FCCBB^PFHQJJFDIJKLHEGGHKF{{KGDJS{K{GFF{H{OO{I{KKN{HJIKHRMJ{DEJFFDFCBEIGHJG{GFNEEKEHEKLHDDB@=DCCAED?EE;::9?93397>=;:=DCABCDBCEEIIEEABBGKGUEHHJHKJHGABC@BC5546===F6666B@BFJFFDDCEGJEC5;;IFB==?A=76689FEGD@A@AIKGGMFEG@{9=<<BDEKHHIEEGDAA@CFCFEFGED:::9;0-+(&(+(())&0..0022114222+,+.-0,+))1>AC=BKFEGHGCDE{=444--,,02+'('''+-/CDEGIHCA?@EEE{HEFHE{GJ{IF{HPDFF@CA889=B?@COC;988,012677452003404476,***00+((/)6;=::>@FFDEFBFEDHEIG{{{H6666@C.,,,<;FEGDC=77BBFGI8889FEDEDFCBCDEG<;@DLPH@??GG:9:YI{JFFBE7(((788:BBFEFFFDKJE{LHI{GBDDF{@@BDAD?>/../1011.-,-33=>@EELGJEG{HIHIFFBB@ALLGGKFHFFEDHGFHGGIJEDCHCJDEIGFJGHUIFXGHHFEIIOGMFFIGGCC:94&