Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 102 forks source link

chain_samples inflated number of isoforms #163

Open rmenafra opened 3 years ago

rmenafra commented 3 years ago

Hi Liz, I'm using chain_samples.py on 6 human samples. The 6 samples were multiplexed and run over 6 SMRTcells and samples were combined after the refine step. After the collapse by isoform step the 6 samples had the following numbers of isoforms: 90465 134305 74215 84366 118929 109500 I used chain_samples with default parameters (in a clean folder) and obtained 19M isoforms. I was wondering if you have any idea of what could have happened. Thanks Roberta

Magdoll commented 3 years ago

There's expected to be some inflation but 19M doesn't make any sense..not even if you add all of them up.

Did you take a look at the all_samples.chained_ids.txt? What does it look like? Can you show the first 20 lines? How many lines are in that file?

-Liz

rmenafra commented 3 years ago

all_samples.chained_ids.txt has 19766014 lines. The first 20 are below, I know that 19M makes no sense.

superPBID PC25 PC45 PC54 PC62 PC64 PC73 PB.1.1 NA PB.1.1 NA NA NA NA PB.2.1 NA NA NA NA PB.1.1 NA PB.2.2 NA NA NA NA PB.1.2 NA PB.3.1 PB.1.1 NA NA NA NA NA PB.3.2 NA NA PB.1.1 NA NA NA PB.3.3 NA NA NA NA PB.2.1 NA PB.4.1 NA PB.2.1 NA NA NA NA PB.4.2 PB.2.1 PB.2.2 NA NA PB.3.4 NA PB.4.3 NA PB.2.3 NA NA PB.3.1 NA PB.4.4 NA NA PB.2.3 PB.1.1 PB.3.2 NA PB.4.5 NA NA NA NA PB.3.3 NA PB.4.6 PB.2.2 NA NA NA PB.3.5 NA PB.4.7 NA NA PB.2.1 NA NA NA PB.4.8 NA NA NA PB.1.2 NA NA PB.4.9 PB.2.3 NA NA NA NA NA PB.4.10 NA NA PB.2.5 NA NA NA PB.4.11 PB.2.4 NA NA NA NA NA PB.4.12 PB.2.5 PB.2.5 NA PB.1.4 NA NA PB.4.13 PB.2.6 PB.2.4 PB.2.6 PB.1.3 PB.3.6 NA

Roberta

osvaldoreisss commented 2 years ago

I'm having the same problem.

config file:

SAMPLE=Pool1_1;collapsed/Pool1_Suzy_1
SAMPLE=Pool2_2;collapsed/Pool2_Atrop_2
SAMPLE=Pool2_3;collapsed/Pool2_Bob_3
SAMPLE=Pool2_4;collapsed/Pool2_Tom_4

GROUP_FILENAME=sample.collapse_isoforms.collapsed.group.txt
GFF_FILENAME=sample.collapse_isoforms.collapsed.gff
COUNT_FILENAME=sample.collapse_isoforms.collapsed.abundance.txt

The initial number of transcripts after collapse is:

./Pool1_1/sample.collapse_isoforms.collapsed.gff
32441
./Pool2_3/sample.collapse_isoforms.collapsed.gff
39427
./Pool2_2/sample.collapse_isoforms.collapsed.gff
49306
./Pool2_4/sample.collapse_isoforms.collapsed.gff
12720

I installed cdna_cupcake v28.0.0 with conda and the command line is:

chain_samples.py config count_fl

The final number of transcripts in all_samples.chained.gff is 2,189,160. Inspecting the GFF I saw many different transcripts ID with the same coordinates:

NC_051805.1 PacBio  transcript  23876887    23917149    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23876887    23877711    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23881774    23881872    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23882552    23882622    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23889795    23889897    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23895117    23895259    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23897711    23897825    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900552    23900665    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900956    23901053    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23901373    23901482    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23903684    23903787    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23904276    23904437    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23905784    23905859    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23907966    23908115    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23910297    23910430    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23917054    23917149    .   -   .   transcript_id "PB.489.1410"; gene_id "PB.489";
NC_051805.1 PacBio  transcript  23876887    23917149    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23876887    23877711    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23881774    23881872    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23882552    23882622    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23889795    23889897    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23895117    23895259    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23897711    23897825    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900552    23900665    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900956    23901053    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23901373    23901482    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23903684    23903787    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23904276    23904437    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23905784    23905859    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23907966    23908115    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23910297    23910430    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23917054    23917149    .   -   .   transcript_id "PB.489.1411"; gene_id "PB.489";
NC_051805.1 PacBio  transcript  23876887    23917149    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23876887    23877711    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23881774    23881872    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23882552    23882622    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23889795    23889897    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23895117    23895259    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23897711    23897825    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900552    23900665    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23900956    23901053    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23901373    23901482    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23903684    23903787    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23904276    23904437    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23905784    23905859    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23907966    23908115    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23910297    23910430    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";
NC_051805.1 PacBio  exon    23917054    23917149    .   -   .   transcript_id "PB.489.1412"; gene_id "PB.489";

This specific gene PB.489 has 1,413 transcripts. but if I count only those with start and end equals and uniq I get 60 transcripts for this gene:

grep "\"PB.489\"" all_samples.chained.gff | awk '{print $1$2$3$4$5}' | sort | uniq | wc -l

I know that counting this way I could be losing some transcripts that have that same start and end because of the different exons usage, but as you can see in the example above, for the most part, shares identical exon usage.

Do you have some clue of how can I fix it?