isoseq refine - Githubissues

marcus-tutert commented 6 months ago

Hi,

I am running the isoseq pipeline documented here: https://isoseq.how/umi/cli-workflow.html

Using the single-cell iso-seq workflow.

I start my pipeline with hifi reads generated from 10x 3' chemistry and have run the first few steps: primer removal, tag removal and now am refining the reads

isoseq refine ${tag_removed_bam_path} /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta \$sample_name.fltnc.bam --require-polya

For some reason this is eliminating nearly all my reads across all my samples -- I am left with just a handful of reads (for context my bam file size pre and post processing goes from 30GB to 8Kb)

Is there something I am doing wrong here?

Thanks!

armintoepfer commented 6 months ago

Does you data have polyA tails? Otherwise, please upload a tiny example BAM with reads that get removed in refine.

marcus-tutert commented 6 months ago

Hi, thank you for your very quick response!

I'm not sure I'll be able to upload a bam due to data governance/restrictions but let me see what I can do.

However, looking at my reads in the flt.bam I can see evidence of polyAtails.

I tried refine with and without the polyatail setting and the resulting reads didn't change all that much (still lost nearly all my data)

armintoepfer commented 6 months ago

Please paste the footer of the log output if you run it with --log-level INFO and also check the reports

marcus-tutert commented 6 months ago

|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json

It is running now but here is what I see so far

marcus-tutert commented 6 months ago

Here is the contents of my fasta file too (comes from the 10x 3' chemistry that is available on the isoseq website)

5p AAGCAGTGGTATCAACGCAGAGTACATGGG 3p AGATCGGAAGAGCGTCGTGTAG

marcus-tutert commented 6 months ago

|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json

|> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Input reads : 7264302 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output reads : 19 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output bases : 24777 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Filtered RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Missing RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Run Time : 11m 944us |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CPU Time : 9h 52m |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Peak RSS : 0.88908 GB

Here is my full log file

marcus-tutert commented 6 months ago

And my reports: id,strand,fivelen,threelen,polyAlen,insertlen,primer m84093_240202_144742_s1/263197381/ccs,+,0,8,25,477,5p--3p m84093_240202_144742_s1/219156810/ccs,+,8,0,28,1451,5p--3p m84093_240202_144742_s1/204345334/ccs,+,8,0,25,1561,5p--3p m84093_240202_144742_s1/123997463/ccs,+,0,8,31,1619,5p--3p m84093_240202_144742_s1/112922188/ccs,+,0,8,41,375,5p--3p m84093_240202_144742_s1/109251907/ccs,+,0,9,31,2371,5p--3p m84093_240202_144742_s1/91559779/ccs,+,0,8,27,786,5p--3p m84093_240202_144742_s1/92542161/ccs,+,0,8,26,1685,5p--3p m84093_240202_144742_s1/75962115/ccs,+,8,0,26,1060,5p--3p m84093_240202_144742_s1/87888614/ccs,+,7,8,102,532,5p--3p m84093_240202_144742_s1/46536294/ccs,+,8,0,28,1906,5p--3p m84093_240202_144742_s1/25367936/ccs,+,8,0,29,1614,5p--3p m84093_240202_144742_s1/34669647/ccs,+,0,8,60,587,5p--3p m84093_240202_144742_s1/35783133/ccs,+,0,8,29,1043,5p--3p m84093_240202_144742_s1/65668734/ccs,+,8,0,31,1578,5p--3p m84093_240202_144742_s1/118557153/ccs,+,8,0,74,1614,5p--3p m84093_240202_144742_s1/119408775/ccs,+,0,8,27,847,5p--3p m84093_240202_144742_s1/111478958/ccs,+,8,0,22,1484,5p--3p m84093_240202_144742_s1/174523972/ccs,+,8,0,45,2187,5p--3p

{ "_comment": "Created by pbcopper v2.3.99", "attributes": [ { "id": "sample_name", "name": "Sample Name", "value": "Isogut14548275" }, { "id": "num_reads_fl", "name": "Full-Length Reads", "value": 7264302 }, { "id": "num_reads_flnc", "name": "Full-Length Non-Chimeric Reads", "value": 22 }, { "id": "num_reads_flnc_polya", "name": "Full-Length Non-Chimeric Reads with Poly-A Tail", "value": 19 } ], "dataset_uuids": [], "id": "isoseq_refine", "plotGroups": [], "tables": [], "title": "Iso-Seq Refine Report", "uuid": "c745eac5-a13f-4fdb-bd0f-2445e78ecb6b", "version": "1.0.1" }

Magdoll commented 6 months ago

Hi @GelatinousGiant - Can you please give a bit more context? Is this using the MAS-Seq for 10x Single Cell 3' kit? Is this 10x 3' cDNA? Did you first run skera to get the segmented reads (S-reads) and use that as input to lima and isoseq refine?

armintoepfer commented 3 months ago

Closing due to lack of input

TheCatalyticMicroChip commented 1 month ago

Hi @Magdoll and @armintoepfer,

I am new to PacBio long read sequencing. I received hifi_reads.fastq, and hifi_reads.bam files which I assume are the unaligned CCS reads, output of CCS tool. I believe our run was on MAS-Seq for 10x Single Cell 3' kit and the reads are the 10X cDNAs. I am facing the similar issue with @GelatinousGiant. I started off with 23G flt.bam ending up with 170K fltnc.bam. My CL:

$ isoseq refine m64410e.flt.bam primers.fasta m64410e.fltnc.bam --require-polya

I see that my flt.bam has about ~12K bases long reads with polyA tracks. Should I have ran skera split first to segment the reads? I went straight to lima with the hifi_reads.bam.

Thanks.

PacificBiosciences / pbbioconda

isoseq refine #667