Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 103 forks source link

demux_isoseq_with_genome.py Issue - expecting input not generated by get_abundance_post_collapse.py #223

Open KrRi4 opened 2 years ago

KrRi4 commented 2 years ago

Hi,

I was hoping someone could help me solve/clarify the issue I'm having below.

First, I am running cDNA_Cupcake v28.0.0, taken from README:

cDNA_Cupcake

Last Updated: 09.15.2021 Current version: 28.0.0

I have 2 IsoSeq CCS files that I ran through steps 1-7 outlined here: https://github.com/Magdoll/cDNA_Cupcake/issues/167. I would like to obtain the sample specific counts which I believe can be generated using the demux_isoseq_with_genome.py script.

In order to obtain the read stat file for the above script, I ran get_abundance_post_collapse.py.

Here are the first few lines of the required input files for demux_isoseq_with_genome.py script:

_collapsed.mapped.readstat.txt id is_fl stat pbid 1_m54328U_191216_220556/90243745/ccs Y unique PB.7953.2 1_m54328U_191216_220556/64751804/ccs Y unique PB.7953.2 1_m54328U_191216_220556/83690194/ccs Y unique PB.7953.2 1_m54328U_191216_220556/53478253/ccs Y unique PB.5007.3 2_m54328U_191216_220556/115672586/ccs Y unique PB.642.8

_all_reportflnc.csv id,strand,fivelen,threelen,polyAlen,insertlen,primer 1_m54328U_191216_220556/4/ccs,+,22,24,33,4386,1_NEB_5p--1_NEB_Clontech_3p 1_m54328U_191216_220556/11/ccs,+,21,24,30,5525,1_NEB_5p--1_NEB_Clontech_3p 1_m54328U_191216_220556/12/ccs,+,22,25,32,4805,1_NEB_5p--1_NEB_Clontech_3p 1_m54328U_191216_220556/14/ccs,+,25,22,30,7448,1_NEB_5p--1_NEB_Clontech_3p 2_m54328U_191216_220556/180554526/ccs,+,22,25,29,3578,2_NEB_5p--2_NEB_Clontech_3p

_collapsedmapped.collapsed.rep.fq @PB.1.1|chr1:14401-29364(-)|transcript/77472 transcript/77472 GAGAGCGGAAGC.... @PB.1.2|chr1:14401-29359(-)|transcript/121124 transcript/121124 GGAAGCGG....

I get the following warning when I run the script... and the resulting output file is empty.

demux_isoseq_with_genome.py --mapped_fafq collapsed_mapped.collapsed.rep.fq --read_stat collapsed_mapped.collapsed.read_stat.txt --classify_csv all_report_flnc.csv -o test Reading all_report_flnc.csv.... Reading collapsed_mapped.collapsed.read_stat.txt.... Reading collapsed_mapped.collapsed.rep.fq.... WARNING: Could not find PB.1.1|chr1:14401-29363(-)|transcript/77472 in collapsed_mapped.collapsed.read_stat.txt. No demux output for this sequence! WARNING: Could not find PB.1.2|chr1:14401-29355(-)|transcript/121124 in collapsed_mapped.collapsed.read_stat.txt. No demux output for this sequence!

It appears the script is looking for the entire P.B.x.y.|chr... string in the read.stat file.
Now, if I remove everything after the P.B.x.y with sed 's/|[a-z].*//g' collapsed_mapped.collapsed.rep.fq > testingSyntax.fq and use testingSnytax.fq for the --mapped file in demux_isoseq_with_genome.py it works!

I wanted to make sure I haven't done something off with this process up until this point, or if this is an issue with the script? So any clarification would be helpful.

Thank you! Kr