aljpetri / isONform

De novo construction of isoforms from long-read data
GNU General Public License v3.0
14 stars 1 forks source link

output need to be reformated #15

Open alexyfyf opened 7 months ago

alexyfyf commented 7 months ago

Hi team,

I found your isonform output fasta file is not a standard format with > line as header. And there are lots of empty files in the isonform fodler such as

(base) [yan.a@vc7-shared isoforms]$ ll cluster26150*
-rw-r--r-- 1 yan.a allstaff 0 Dec  2 07:59 cluster26150_mapping_low_abundance.txt
-rw-r--r-- 1 yan.a allstaff 0 Dec  2 07:59 cluster26150_mapping.txt
-rw-r--r-- 1 yan.a allstaff 0 Dec  2 07:59 cluster26150_merged.fa
-rw-r--r-- 1 yan.a allstaff 0 Dec  2 07:59 cluster26150_merged_low_abundance.fa

Also, can you explain what the numbers in the header line means, for example this one

@0_105_891
ACUUCGACCAAGAAGAGAUACGGUGCUCUCGCCGGUAACGUCGGUGACGAAGGUGGUGUUGCUCCAAACAUUCAAACCGCUGAAGAAGCUUUGGACUUGAUUGUUGACGCUAUCAAGGCUGCUGGUCACGACGGUAAGGUCAAGAUCGGUUUGGACUGUGCUUCCUCUGAAUUUUCAAGGACGGUAAGUACGACUUGGACUUCAAGAACCCAGAAUCUGACAAAUCCAAGUGGUUGACUGGUGUCGAAUUGGCUGACAUGUACCACUCCUUGAUGAAGAGAUACCCAAUUGUCUCCAUCGAAGAUCCAUUUGCUGAAGAUGACUGGGAAGCUUGGUCUUCACUUCAAGACCGCUGGUAUCCAAAUUGUUGCUGAUGAUUUGACUGUCACCAACCCAGCUAGAAUUGCUACCGCCAUCGAAAAGAAGGCUGCUGACGCUUUGUUGUUGAAGGUUAACCAAAUCGGUACCUUGUCUGAAUCCAUCAAGGCUGCUCAAGACUUUCCUGCCAACUGGGUGUCAUGGUUUCCCACAGAUCUGGUGAAACUGAAGACACUUCAUUGCUGACUUGGUUGUCGGUUUGAGAACUGGUCAAAUCAAGACUGGUGCUCCAGCUAGAUCCGAAAGAUUGGCUAAGUUGAACCAAUUGUUGAGAAUCGAAGAAGAAUUGGGUGACAAGGCUGUCUACGCCGGUGAAAACUUCCACCACGGUGACAAGUUGUAUCGUCGUGAGUAGUGAACCGUAAGCAAAAAAAUUCCCUCAACCAUCUUAUAUCCAUUCAACCUACCAUUCCUCAAUCA

Thank you so much.

Alex

aljpetri commented 7 months ago

Hi thank you for reporting this error. I have pushed a new release now that should fix the fasta format output. The idea behind the header for each isoform is as follows: The first number in your case '0' denotes which cluster the isoform was generated from. The second number (in your case '105') gives the batch number in the cluster (we divide each cluster in batches of 1000 reads each), while the third number contains an individual id so we do not get any double isoforms for the same id. I will address the problem with the empty intermediate files in the next days. Best, Alex

alexyfyf commented 7 months ago

Hi Alex,

Thank you for your reply. So my understanding is that your transcript identifications are derived from gene clusters from isonclust, so the cluster id, ie the first number, could be used as gene id surrogates? Am I correct?

Thank you. Alex

---- Replied Message ---- | From | Alexander J @.> | | Date | 12/04/2023 21:17 | | To | aljpetri/isONform @.> | | Cc | Feng @.>, Author @.> | | Subject | Re: [aljpetri/isONform] output need to be reformated (Issue #15) |

Hi thank you for reporting this error. I have pushed a new release now that should fix the fasta format output. The idea behind the header for each isoform is as follows: The first number in your case '0' denotes which cluster the isoform was generated from. The second number (in your case '105') gives the batch number in the cluster (we divide each cluster in batches of 1000 reads each), while the third number contains an individual id so we do not get any double isoforms for the same id. I will address the problem with the empty intermediate files in the next days. Best, Alex

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

aljpetri commented 7 months ago

Hi Alex, the clusters generated by isONclust represent gene families and not genes themselves and therefore it would be dangerous using them as gene surrogates. Best, Alex