Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 104 forks source link

Question about outputs for collapse_isoforms_by_sam.py & get_abundance_post_collapse.py #95

Closed WallyL closed 5 years ago

WallyL commented 5 years ago

Hi Liz,

I'm running a set of human data that has been processed per the latest IsoSeq3 scripts. The Cupcake input file is HQ isoforms generated after isoseq3 polish. Minimap2 was run vanilla:

minimap2 -ax splice -t 8 -uf --secondary=no -C5 Homo_sapiens.GRCh38.dna.toplevel.fa \
AC_polished.hq.fastq > AC_hq.sam

So after I ran the Cupcake pipeline (it ran fine, w/no errors), I tried rarefaction and it failed when run with --by refgene and --by refisoform flags, i.e. no output. (--by pbgene seemed to work). So I went back and looked at some of the collapse and abundance report outputs and found the following and wanted check to see if these look okay. Here are the headers of the files:

1) AC_hq.collapsed.group.txt output from _collapse_isoforms_bysam.py probably fine, just doesn't look anything like the output in your Cupcake example.

PB.1.1  transcript/23910
PB.2.1  transcript/9327
PB.2.2  transcript/9980
PB.2.3  transcript/9895
PB.2.4  transcript/16058
PB.3.1  transcript/28571
PB.4.1  transcript/29272
PB.4.2  transcript/28803,transcript/30061
PB.5.1  transcript/307
PB.5.2  transcript/397

2) AC_hq.collapsed.read_stat.txt output from _get_abundance_postcollapse.py- the length field is all NA. I see the mmap2 out has

WARNING: isoseq3 format detected. Output length column will be NA

but can be resolved after the fact with _get_seqstats.py Curious if you know if there's a mmap2 flag that will fix this and/or do I need to include the lengths prior to running later steps?

id  length  is_fl   stat    pbid
m54193_190813_205800/27066957/ccs   NA  Y   unique  PB.9184.3
m54193_190813_205800/25887454/ccs   NA  Y   unique  PB.9184.3
m54193_190813_205800/73859721/ccs   NA  Y   unique  PB.9184.3

3) AC_hq.collapsed.abundance.txt output from _get_abundance_postcollapse.py The fl and nfl fields are all the same...

head
pbid    count_fl    count_nfl   count_nfl_amb   norm_fl norm_nfl    norm_nfl_amb
PB.2.1  4   4   4   1.7230e-05  1.7230e-05  1.7230e-05
PB.2.2  42  42  42  1.8091e-04  1.8091e-04  1.8091e-04
PB.2.3  28  28  28  1.2061e-04  1.2061e-04  1.2061e-04
PB.2.4  2   2   2   8.6148e-06  8.6148e-06  8.6148e-06
PB.3.1  3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.4.1  229 229 229 9.8639e-04  9.8639e-04  9.8639e-04

tail
PB.9199.1   3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.9199.2   3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.9199.3   44  44  44  1.8953e-04  1.8953e-04  1.8953e-04
PB.9199.4   7   7   7   3.0152e-05  3.0152e-05  3.0152e-05

Should I be concerned about this last output? I'll post later regarding rarefaction, if I can't get it working, but wanted to ask about this file first, since it looks like it's an input for _make_file_for_subsampling_fromcollapsed.py

Thanks, Walt

Magdoll commented 5 years ago

Hi @WallyL ,

--by refgene and --by refisoform will only work if you provide a SQANTI2 classification when you ran make_file_for_subsampling_from_collapsed.py using the -m2 option. Otherwise, there is no reference gene or transcript info. This is most obvious by looking at your input to the rarefaction (which is the output of make_file_for_subsampling_from_collapsed.py), the refgene and refisoform columns must be populated for subsample.py to work with them.

--LIz

WallyL commented 5 years ago

Thanks, Liz. I saw SQUANT2 section, but didn't catch that it was required.

So the identical count_fl count_nfl values in the last abundance output is nothing to be concerned about?

Walt

Huangyizhong commented 5 years ago

Hi Liz,

I'm running a set of human data that has been processed per the latest IsoSeq3 scripts. The Cupcake input file is HQ isoforms generated after isoseq3 polish. Minimap2 was run vanilla:

minimap2 -ax splice -t 8 -uf --secondary=no -C5 Homo_sapiens.GRCh38.dna.toplevel.fa \
AC_polished.hq.fastq > AC_hq.sam

So after I ran the Cupcake pipeline (it ran fine, w/no errors), I tried rarefaction and it failed when run with --by refgene and --by refisoform flags, i.e. no output. (--by pbgene seemed to work). So I went back and looked at some of the collapse and abundance report outputs and found the following and wanted check to see if these look okay. Here are the headers of the files:

  1. AC_hq.collapsed.group.txt output from _collapse_isoforms_bysam.py probably fine, just doesn't look anything like the output in your Cupcake example.
PB.1.1    transcript/23910
PB.2.1    transcript/9327
PB.2.2    transcript/9980
PB.2.3    transcript/9895
PB.2.4    transcript/16058
PB.3.1    transcript/28571
PB.4.1    transcript/29272
PB.4.2    transcript/28803,transcript/30061
PB.5.1    transcript/307
PB.5.2    transcript/397
  1. AC_hq.collapsed.read_stat.txt output from _get_abundance_postcollapse.py- the length field is all NA. I see the mmap2 out has

WARNING: isoseq3 format detected. Output length column will be NA

but can be resolved after the fact with _get_seqstats.py Curious if you know if there's a mmap2 flag that will fix this and/or do I need to include the lengths prior to running later steps?

id    length  is_fl   stat    pbid
m54193_190813_205800/27066957/ccs NA  Y   unique  PB.9184.3
m54193_190813_205800/25887454/ccs NA  Y   unique  PB.9184.3
m54193_190813_205800/73859721/ccs NA  Y   unique  PB.9184.3
  1. AC_hq.collapsed.abundance.txt output from _get_abundance_postcollapse.py The fl and nfl fields are all the same...
head
pbid  count_fl    count_nfl   count_nfl_amb   norm_fl norm_nfl    norm_nfl_amb
PB.2.1    4   4   4   1.7230e-05  1.7230e-05  1.7230e-05
PB.2.2    42  42  42  1.8091e-04  1.8091e-04  1.8091e-04
PB.2.3    28  28  28  1.2061e-04  1.2061e-04  1.2061e-04
PB.2.4    2   2   2   8.6148e-06  8.6148e-06  8.6148e-06
PB.3.1    3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.4.1    229 229 229 9.8639e-04  9.8639e-04  9.8639e-04

tail
PB.9199.1 3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.9199.2 3   3   3   1.2922e-05  1.2922e-05  1.2922e-05
PB.9199.3 44  44  44  1.8953e-04  1.8953e-04  1.8953e-04
PB.9199.4 7   7   7   3.0152e-05  3.0152e-05  3.0152e-05

Should I be concerned about this last output? I'll post later regarding rarefaction, if I can't get it working, but wanted to ask about this file first, since it looks like it's an input for _make_file_for_subsampling_fromcollapsed.py

Thanks, Walt

Hi. Have you solved this problem? I also confused about it and could not get the information about the refgene and refisoform columns Thanks Toney

Magdoll commented 5 years ago

Hi @WallyL ,

Yes it is expected count_fl and count_nfl are now the same. We no longer use nFL in IsoSeq3 so only the count_fl matters.

-Liz