samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file?

cea295933 commented 7 months ago

Ask away!

Hello,

My workflow seems to be failing as a result of what I think is a samtools error.

Command: minimap2 -t 4 -ax splice -uf -p 1.0 "genome_index.mmi" "seqs.fastq.gz" | samtools view -Sb > "output.bam" 3499 samtools sort -@ 4 "output.bam" -o "WTpoly2_reads_aln_sorted.bam"

Error: [WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index. [main_samview] fail to read the header from "-".

I've taken a quick look at the files being maniupulated here (in the offending work directory). The .mmi does not appear to be human readable but is at least not empty (I can provide if you like ... GitHub won't allow me to upload). And seqs.fastq.gz is certainly not empty: 5G compressed and 20G uncompressed (I am pasting just the first entry for your reference below). However, output.bam is empty, and so I think this causes the issue for the samtools sort command. Any idea what could be causing this?

@4aabeb37-0ae5-4289-ab27-e6f246a7e21a runid=215283ee2a07610bc86fc4baad99de671b475048 read=74966 ch=2198 start_time=2023-09-23T00:45:55.980051-04:00 flow_cell_id=PAO05278 protocol_group_id=Vassar_smm092223 sample_id=NB_A11-H11 barcode=barcode84 barcode_alias=barcode84 parent_read_id=4aabeb37-0ae5-4289-ab27-e6f246a7e21a basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 ATGTTATGTCCTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAAGTAGTGGACCTAGAACCTGTGCCACAGCACCTCGCCTACCGTGACAAGAAAGTTGTCGGTGTCTTTGTGACTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTGCCGATGTGATTCTAAAGATGTCATTCATATATTCAAGTTAAAAAATATATAGAATTGATAAGCCATATATAGAAACAATTACTAAAGAGAGAAAAAAAACCAAGTCTGGTAAAAGAAAAAGAAAAAGAGCCTAATATACGACACTTTGGTTCAAATAATTCACCAGAAGTCCTACGTTTTATCCTCCACTTATTATTATATCATCTTGTTAATTTTATTTCTTTAAAATTTAGCAAATTGCTTGTTGGCCTTACCGGCAGCAGCAGAGACCTTGACGACGTTGAATCTAACAGTCTTGGAGATTGGTCTACATTGACCAACGGTAACAATGTCACCAACTTGGACACGGAAAGCTGGGGAGACGTGGACTGGGACGTTCTTGTGTCTCTTTTCGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCTCTTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTATCAATGTAAGAACCTTCAATAGCGGTCTTTGGGGTCTTGAATCCCAAACCGGCATTCTTATACCATCTCTTGGTTCTCTTGGAAGTCTTGACCTTTGGATTGTTGAAGATGTGAGGTTGCTTTTGGAAAGCTCTTTAGATTGAACAGTTAATTCAGTGGACATCTTTTTTCCCTGGCTTGATACCCCAAAGTCCAAGGGCAATTCGAACTGGAAAGCAATATCAGCACCAACAGAAACACAAAGACACCGACAACTTTCTTGTCACC + %%&'(++)%%(+,++++235891122><=>>FIDDCCFD=714211<BA?BBHGITFFEDDEEEE{HIPCBA@@?@ADJPHH{I{NHE=<<=BCECEIDGHGNHFJHI{{JHPFDICE{F@@AAFBCCBI=888KHFNL{HJEGDDDEC=EHEHJ@D67>?ADRI{{{{{{{{{HBFCIJB?=.,,9{>:9433267998:99:?>E{J{P{;:99@@B<<?{GFKEL{{GICCBFGGHDIHGFEFE{{{{EHN{IHFMHDDB@ACB;{{{{2..AA?@@@@BB??ANNHJJ{G{@?FC;>?DHME=>AAG>???C@?777;>@BW>@?AFQEJSIJHIHFGH{F{QIGEH{{FCCBB^PFHQJJFDIJKLHEGGHKF{{KGDJS{K{GFF{H{OO{I{KKN{HJIKHRMJ{DEJFFDFCBEIGHJG{GFNEEKEHEKLHDDB@=DCCAED?EE;::9?93397>=;:=DCABCDBCEEIIEEABBGKGUEHHJHKJHGABC@BC5546===F6666B@BFJFFDDCEGJEC5;;IFB==?A=76689FEGD@A@AIKGGMFEG@{9=<<BDEKHHIEEGDAA@CFCFEFGED:::9;0-+(&(+(())&0..0022114222+,+.-0,+))1>AC=BKFEGHGCDE{=444--,,02+'('''+-/CDEGIHCA?@EEE{HEFHE{GJ{IF{HPDFF@CA889=B?@COC;988,012677452003404476,***00+((/)6;=::>@FFDEFBFEDHEIG{{{H6666@C.,,,<;FEGDC=77BBFGI8889FEDEDFCBCDEG<;@DLPH@??GG:9:YI{JFFBE7(((788:BBFEFFFDKJE{LHI{GBDDF{@@BDAD?>/../1011.-,-33=>@EELGJEG{HIHIFFBB@ALLGGKFHFFEDHGFHGGIJEDCHCJDEIGFJGHUIFXGHHFEIIOGMFFIGGCC:94&

cea295933 commented 7 months ago

ok ... trying both of these

even now, I can find the "counts.tsv" in the working folder ... is that complete at this point of the pipeline? if so, I can use that to run DESeq2 in R myself ... (at least as long as this now works on the complete dataset ... haven't tried that yet)

cea295933 commented 7 months ago

those don't appear to work ... attaching the files in the work directory in the event they spark any ideas files.zip

sarahjeeeze commented 7 months ago

Hi, hmm looks like you have all novel transcripts - does BK006934.2 match up with the reference file fasta header?

sarahjeeeze commented 7 months ago

oh sorry i see you transcript_ids are all 'unassigned_transcript_1234' is that expected?

cea295933 commented 7 months ago

yeah, that's strange. on one of my previous runs I didn't see that ... I don't remember now whether it was refseq or genbank (or it could be a gff vs. gtf) .... I'll try to figure what the combination was, but that all_counts.tsv is below all_counts.txt

cea295933 commented 7 months ago

ok ... if I run using the GenBank .fa and .gff, I get transcripts assigned. But if I replace with the .gtf (or use any combination of the RefSeq files), I get either unassigned transcripts or MSTRG (which I think come from stringtie). But I appear to get the same/similar error for all.

NCBI Gengank .fa and .gff 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 rna-YNCL0010C 1369.0 1222.0 1721.0 2023.0 3 rna-YNCL0012C 1148.0 1044.0 1435.0 1717.0 4 rna-YLR154C-G 274.0 212.0 214.0 369.0 5 rna-YNCH0007W 230.0 212.0 372.0 455.0 6 rna-YNCL0016C 208.0 165.0 276.0 299.0 7 MSTRG.1131.1 188.0 151.0 107.0 112.0 8 rna-YBR118W 175.0 147.0 156.0 147.0 9 rna-YGL103W 150.0 163.0 92.0 114.0 10 rna-YLR110C 142.0 152.0 162.0 196.0

Error 2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)' 2385 2386 Caused by: 2387 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) 2388 2389 Command executed: 2390 2391 mkdir merged 2392 mkdir de_analysis 2393 mv de_transcript_counts.tsv merged/all_counts.tsv 2394 mv BarcodesPoly.csv de_analysis/coldata.tsv 2395 de_analysis.R annotation.gtf 3 1 10 3 gtf true 2396 2397 Command exit status: 2398 1 2399 2400 Command output: 2401 Loading counts, conditions and parameters. 2402 Loading annotation database. 2403 2404 Command error: 2405 Loading counts, conditions and parameters. 2406 Warning message: 2407 In read.table(file = file, header = header, sep = sep, quote = quote, : 2408 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv' 2409 Loading annotation database. 2410 Import genomic features from the file as a GRanges object ... OK 2411 Prepare the 'metadata' data frame ... OK 2412 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) : 2413 values in 'transcripts$tx_strand' must be "+" or "-" 2414 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts 2415 Execution halted 2416 2417 Work dir: 2418 /work/epi2me/work/d3/fd1f38bd3fe7e3c1162b91b2ac41c5

NCBI Genbank .fa and .gtf 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 unassigned_transcript_4097 1363.0 1221.0 1722.0 2025.0 3 unassigned_transcript_4099 1150.0 1042.0 1436.0 1719.0 4 unassigned_transcript_4110 278.0 215.0 209.0 365.0 5 unassigned_transcript_2712 230.0 212.0 372.0 455.0 6 unassigned_transcript_4104 208.0 166.0 277.0 298.0 7 MSTRG.1131.1 188.0 151.0 107.0 112.0 8 unassigned_transcript_350 175.0 147.0 156.0 147.0 9 unassigned_transcript_2160 150.0 163.0 92.0 114.0 10 unassigned_transcript_4051 142.0 152.0 162.0 196.0

Error 2382 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)' 2383 2384 Caused by: 2385 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) 2386 2387 Command executed: 2388 2389 mkdir merged 2390 mkdir de_analysis 2391 mv de_transcript_counts.tsv merged/all_counts.tsv 2392 mv BarcodesPoly.csv de_analysis/coldata.tsv 2393 de_analysis.R annotation.gtf 3 1 10 3 gtf true 2394 2395 Command exit status: 2396 1 2397 2398 Command output: 2399 Loading counts, conditions and parameters. 2400 Loading annotation database. 2401 2402 Command error: 2403 Loading counts, conditions and parameters. 2404 Warning message: 2405 In read.table(file = file, header = header, sep = sep, quote = quote, : 2406 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv' 2407 Loading annotation database. 2408 Import genomic features from the file as a GRanges object ... OK 2409 Prepare the 'metadata' data frame ... OK 2410 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) : 2411 values in 'transcripts$tx_strand' must be "+" or "-" 2412 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts 2413 Execution halted 2414 2415 Work dir: 2416 /work/epi2me/work/9c/84eec819b9dfa17fda6c0626238884

NCBI Refseq .fa and .gff 1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2 2 MSTRG.263.1 188.0 151.0 107.0 112.0 3 MSTRG.289.1 159.0 175.0 96.0 121.0 4 MSTRG.457.1 152.0 130.0 94.0 119.0 5 MSTRG.511.1 148.0 157.0 167.0 198.0 6 MSTRG.32.1 140.0 134.0 114.0 106.0 7 MSTRG.768.1 127.0 133.0 81.0 81.0 8 MSTRG.61.1 113.0 96.0 111.0 100.0 9 MSTRG.198.1 112.0 82.0 39.0 38.0 10 MSTRG.515.1 109.0 107.0 74.0 72.0

Error 2283 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)' 2284 2285 Caused by: 2286 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) 2287 2288 Command executed: 2289 2290 mkdir merged 2291 mkdir de_analysis 2292 mv de_transcript_counts.tsv merged/all_counts.tsv 2293 mv BarcodesPoly.csv de_analysis/coldata.tsv 2294 de_analysis.R annotation.gtf 3 1 10 3 gtf true 2295 2296 Command exit status: 2297 1 2298 2299 Command output: 2300 Loading counts, conditions and parameters. 2301 Loading annotation database. 2302 Filtering counts using DRIMSeq. 2303 2304 Command error: 2305 Loading counts, conditions and parameters. 2306 Warning message: 2307 In read.table(file = file, header = header, sep = sep, quote = quote, : 2308 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv' 2309 Loading annotation database. 2310 Import genomic features from the file as a GRanges object ... OK 2311 Prepare the 'metadata' data frame ... OK 2312 Make the TxDb object ... OK 2313 'select()' returned 1:many mapping between keys and columns 2314 Filtering counts using DRIMSeq. 2315 Error in dmDSdata(counts = counts, samples = coldata) : 2316 mode(counts) %in% "numeric" is not TRUE 2317 Calls: dmDSdata -> stopifnot 2318 Execution halted 2319 2320 Work dir: 2321 /work/epi2me/work/7f/bbb4ae1a2db55fea2627f197d32994

NCBI Refseq .fa and .gtf 1 Reference MUTpoly2 WTpoly1 MUTpoly1 WTpoly2 2 MSTRG.263.1 188.0 107.0 151.0 112.0 3 MSTRG.289.1 159.0 96.0 175.0 121.0 4 MSTRG.457.1 152.0 94.0 130.0 119.0 5 MSTRG.511.1 148.0 167.0 157.0 198.0 6 MSTRG.32.1 140.0 115.0 135.0 106.0 7 MSTRG.768.1 127.0 81.0 133.0 81.0 8 MSTRG.61.1 113.0 111.0 96.0 100.0 9 MSTRG.198.1 112.0 39.0 82.0 38.0 10 MSTRG.515.1 109.0 74.0 107.0 72.0

Error 2206 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)' 2207 2208 Caused by: 2209 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) 2210 2211 Command executed: 2212 2213 mkdir merged 2214 mkdir de_analysis 2215 mv de_transcript_counts.tsv merged/all_counts.tsv 2216 mv BarcodesPoly.csv de_analysis/coldata.tsv 2217 de_analysis.R annotation.gtf 3 1 10 3 gtf true 2218 2219 Command exit status: 2220 1 2221 2222 Command output: 2223 Loading counts, conditions and parameters. 2224 Loading annotation database. 2225 Filtering counts using DRIMSeq. 2226 2227 Command error: 2228 Loading counts, conditions and parameters. 2229 Warning message: 2230 In read.table(file = file, header = header, sep = sep, quote = quote, : 2231 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv' 2232 Loading annotation database. 2233 Import genomic features from the file as a GRanges object ... OK 2234 Prepare the 'metadata' data frame ... OK 2235 Make the TxDb object ... OK 2236 'select()' returned 1:many mapping between keys and columns 2237 Filtering counts using DRIMSeq. 2238 Error in dmDSdata(counts = counts, samples = coldata) : 2239 mode(counts) %in% "numeric" is not TRUE 2240 Calls: dmDSdata -> stopifnot 2241 Execution halted 2242 2243 Work dir: 2244 /work/epi2me/work/01/c91d7314e0a6a9978ccec290d477f9

sarahjeeeze commented 7 months ago

Sorry, I feel like its nearly there! Does it retry that DE_analysis step? I would expect that error but it to retry that step 4 times and hopefully one of the retries will work, as we iterate through trying a few different gtf2/gff3 R options. Check you are on the latest version within the app (if you are in the app). Or try nextflow pull epi2me-labs/wf-transcriptomes if you are on cmd line to check you have the latest version.

cea295933 commented 7 months ago

I think so too … thanks to you!

It looks like it retries twice but fails at that moment:

2379 [27/4f170f] process > output (13) [ 69%] 16 of 23 2380 Retry deAnalysis with gff format setting and version removal. 2381 Retry deAnalysis with gtf format setting and version removal. 2382 [58/b882f6] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (2) 2383 [a4/ff1720] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (3) 2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'

I tried the update but received the below error. Perhaps I need to ask my sysadmin to run it? Is it ok to run this on the head node or do we need to run it on the compute nodes?

@.***:/work/epi2me[07:47]$ ./nextflow pull epi2me-labs/wf-transcriptomes Checking epi2me-labs/wf-transcriptomes ... epi2me-labs/wf-transcriptomes contains uncommitted changes -- cannot pull from repository

Colin Echeverría Aitken

Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430

On Nov 21, 2023, at 4:17 AM, Sarah Griffiths @.***> wrote:

nextflow pull epi2me-labs/wf-transcriptomes

sarahjeeeze commented 7 months ago

No i'm sorry its still not working!

Ok on the files in your files.zip I did this to find the two which have strand set to . ,

grep -P '\t\.\t\.\t' annotation.gtf
BK006938.2      StringTie       transcript      1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; 
BK006938.2      StringTie       exon    1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1";

Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete.

This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it

grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf

cea295933 commented 7 months ago

Above and beyond! … where can I find the subworkflows directory? I tried find, but it’s taking a while ...

Colin Echeverría Aitken

Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430

On Nov 21, 2023, at 12:45 PM, Sarah Griffiths @.***> wrote:

No i'm sorry its still not working!

Ok on the files in your files.zip I did this to find the two which have strand set to . ,

grep -P '\t.\t.\t' annotation.gtf BK006938.2 StringTie transcript 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; BK006938.2 StringTie exon 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1"; Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete.

This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it

grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf mv tmp.gtf annotation.gtf — Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-transcriptomes/issues/45#issuecomment-1821380550, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD2SWUYDNC4OOW4SMMFYKVDYFTSEBAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGM4DANJVGA. You are receiving this because you were mentioned.

cea295933 commented 7 months ago

ok .. I think I found it. Before line 92, right? I see the de_analysis.R script being called on line 92, and it appears to use the annotation.gtf as an input if I understand if correctly

cea295933 commented 7 months ago

ok, I changed differential_expression.nf as below:

""" 88 mkdir merged 89 mkdir de_analysis 90 mv $merged_tsv merged/all_counts.tsv 91 mv $sample_sheet de_analysis/coldata.tsv 92 93 grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf 94 mv tmp.gtf annotation.gtf 95 96 de_analysis.R annotation.gtf $params.min_samps_gene_expr $params.min_samps_feature_expr $params.min_gene_expr $params.min_feature_expr $annotation_type $strip_version

but now I get the below the moment I launch the workflow, so I think I made the edit incorrectly (sorry!)

ERROR ~ Module compilation error

file : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes/./subworkflows/differential_expression.nf
cause: token recognition error at: ' de_analysis/coldata.tsv\n\t\n\tgrep -v -P '\t.' @ line 91, column 21. mv $sample_sheet de_analysis/coldata.tsv

sarahjeeeze commented 7 months ago

ok sorry forgot to escape the '\'

so should of been -

 grep -v -P '\\t\\.\\t\\.\\t' annotation.gtf > tmp.gtf
 mv tmp.gtf annotation.gtf

cea295933 commented 7 months ago

completed (with the downsampled data set)!!! Huzzah! I just launched jobs for the full data set to see if those work.

Thank you!!!

sarahjeeeze commented 7 months ago

ah okay yes, now the challenge is the full data set - let me know how it goes

cea295933 commented 7 months ago

completed! Thank you so much. One question: is there a meaningful difference in the pipeline with and without the minimap2 option we incorporated ("w 50")? Trying to determine whether it's worth testing it without that flag or not, now that everything works.

sarahjeeeze commented 7 months ago

Good point! It may result in a few more unaligned reads, so you may be missing some transcripts. Now its working it might be worth trying with the standard "w 10" to see if it still works, or "w 25".

sarahjeeeze commented 6 months ago

Closing through lack of response and issues solves

cea295933 commented 6 months ago

My apologies for not closing the loop here. Yes, this is now resolved. Running smoothly on the full dataset and with either set of minimap2 options. Thank you for all of your help!

Colin Echeverría Aitken

Assistant Professor Biology Department Biochemistry Program Vassar College @. @.> 845.437.7430

On Dec 12, 2023, at 5:57 AM, Sarah Griffiths @.***> wrote:

Closing through lack of response and issues solves

— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-transcriptomes/issues/45#issuecomment-1851808378, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD2SWU4W22IZMKFLKI7GWKLYJA2DNAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJRHAYDQMZXHA. You are receiving this because you were mentioned.

epi2me-labs / wf-transcriptomes

samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file? #45

Ask away!