BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
201 stars 69 forks source link

Gene model naming issue #323

Open sagnikbanerjee15 opened 4 months ago

sagnikbanerjee15 commented 4 months ago

I have read the paper (https://doi.org/10.1038/s41467-020-15171-6)
and the manual (https://flair.readthedocs.io/en/latest/) and I still have a question about

I found several gene models for which different transcripts are present on different strands. In fact, there are about 35 of these cases.

1   FLAIR   transcript  31365197    31365505    .   +   .   gene_id "1:31365000"; transcript_id "e00705c2-1531-4e86-8367-c64e2c9b1d93";
1   FLAIR   exon    31365197    31365505    .   +   .   gene_id "1:31365000"; transcript_id "e00705c2-1531-4e86-8367-c64e2c9b1d93"; exon_number "0";
1   FLAIR   transcript  31365460    31365795    .   -   .   gene_id "1:31365000"; transcript_id "e4703547-21b4-46a5-9b19-51e737f9fa2c";
1   FLAIR   exon    31365460    31365795    .   -   .   gene_id "1:31365000"; transcript_id "e4703547-21b4-46a5-9b19-51e737f9fa2c"; exon_number "0";

11  FLAIR   transcript  10509258    10509450    .   -   .   gene_id "11:10509000"; transcript_id "db3b1fe9-88b3-4a29-a50c-ccef9e31d1cc";
11  FLAIR   exon    10509258    10509450    .   -   .   gene_id "11:10509000"; transcript_id "db3b1fe9-88b3-4a29-a50c-ccef9e31d1cc"; exon_number "0";
11  FLAIR   transcript  10509654    10509795    .   -   .   gene_id "11:10509000"; transcript_id "5edd4968-2d1e-46be-89d6-ba7b4d561a31";
11  FLAIR   exon    10509654    10509795    .   -   .   gene_id "11:10509000"; transcript_id "5edd4968-2d1e-46be-89d6-ba7b4d561a31"; exon_number "0";
11  FLAIR   transcript  10509825    10512775    .   +   .   gene_id "11:10509000"; transcript_id "0777a816-cd00-40c1-a34c-579bb4ebbbde";
11  FLAIR   exon    10509825    10512775    .   +   .   gene_id "11:10509000"; transcript_id "0777a816-cd00-40c1-a34c-579bb4ebbbde"; exon_number "0";

Additionally, I found some cases where the start is greater than the end

1       FLAIR   exon    160222651       160222641       .       -       .       gene_id "ENSG00000162729"; transcript_id "bb796a6e-d129-4d50-a94e-dd0d7ff6f1d5"; exon_number "3";
19      FLAIR   exon    12711296        12711294        .       -       .       gene_id "19:12643000"; transcript_id "3176165f-f380-43ee-8b52-789e9cefbe7b"; exon_number "1";
19      FLAIR   exon    41892331        41892323        .       +       .       gene_id "ENSG00000105372"; transcript_id "c5de63d4-73cd-4fed-86d9-21f7fb217ee0"; exon_number "4";
19      FLAIR   exon    54142926        54142923        .       +       .       gene_id "ENSG00000170906"; transcript_id "53ce5865-a4eb-4fe0-8640-d24b9edd1052"; exon_number "4";
9       FLAIR   exon    137245639       137245635       .       -       .       gene_id "9:137220000"; transcript_id "2e0b6749-e106-4d2b-a420-f341fba59fe4"; exon_number "1";
16      FLAIR   exon    46711587        46711583        .       -       .       gene_id "ENSG00000069329"; transcript_id "e20859d6-93ef-446e-931b-a8bda2708543"; exon_number "15";
4       FLAIR   exon    108651031       108651029       .       +       .       gene_id "ENSG00000109475"; transcript_id "a6cf1cef-eb9a-4f35-9b8e-20dbe60a2dc9"; exon_number "4";

Could you please look into these issues?

Thanks

Jeltje commented 4 months ago

Opposite strand: It's not easy to infer strand from unspliced reads, which is why single exon isoforms may end up on random strands. Flair isn't really meant for finding novel single exon genes, and it's generally a good idea to run with the --annotation_reliant setting.

Coordinates: This is very worrisome. Sadly I don't see it happen in any of my test outputs so I can't fix this yet. Do you see this problem in your annotation gtf (awk '$5 < $4' annot.gtf) or the corrected bed file ('awk $3 < $2' corrected.bed) that you input to collapse?