SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
89 stars 29 forks source link

add a table of new ID and previous ID after modifying gffs #34

Closed jeanmanguy closed 4 years ago

jeanmanguy commented 4 years ago

Hi,

Not sure if it was addressed somewhere else or if I missed a command line argument

The normalisation of the gffs change the ID from the input gffs and make it difficult to use data computed from the original gffs and the output of PIRATE. When running PIRATE only with the CDS the tRNA and mRNA are removed thus shifting the ID. I was very confused when manually checking PIRATE results because the names/length of sequences with the PIRATE.gene_families didn't always match (IDs at the top of the input gff files may not modified because tRNA and rRNA are only found later in the file)

Can I suggest to either always use the original ID in the output file or provide a table with the new and old ID so when can correctly merge datasets.

jeanmanguy commented 4 years ago

sorry it seems to be covered by subsample_outputs.pl, but why not use the previous locus tags by default?

Also, for subsample_outputs.pl the README needs to be updated: --field "prev_locustag" doesn't work but --field "prev_locus" does

SionBayliss commented 4 years ago

Hi Jean,

Thanks for that, I have updated the README in master to fix this error and (hopefully) provide more clarity.

I actually struggled with this question during development and I am not satisfied with in as it stands. I tested PIRATE on a number of collabs' datasets during development that had been sourced from various annotation software and databases. I was surprised by the amount of non-standard characters, tags and fields included in the GFFs (such as forward slashes in gene names!). As a workaround I created temporary GFFs with fixed IDs. PIRATE has a number of post-processing tools that rely on the output tables (PIRATE.gene_families.tsv etc.) and they reference internal files produced throughout the pipeline. It became incrementally more complex to simply return to original nomenclature at the end of the pipeline without a major rewrite.

In essence it is a trade-off between telling users to 'fix their datasets', which may lead to accessibility issues, or providing a post-processing script for users need the additional info.

I am sure this isn't the final iteration of the pipeline, so suggestions for the To Do list are always appreciated.

I hope that helps, Sion

jeanmanguy commented 4 years ago

Hi Sion,

I totally understand your concern about weird names, been there done that, although in the past I chose to the other way to do it but I understand your choice too. I guess my dataset was "too clean" so I didn't see at first glance there was a difference between the old and new IDs as only some numbers changed. I am a bit mad at myself I wasted a lot of time because I didn't read the doc well enough.

Thanks

SionBayliss commented 4 years ago

I am sorry that that happened. Do you think that it would be helpful to have a warning in the STDOUT/log?

jeanmanguy commented 4 years ago

I don't know, I don't always look at the log, especially when it takes some time to run, I guess other people don't too.

What I did before finding the subsample_output script was to parse the modified_gffs/* with awk to create a table with both the old and new IDs (that's how I found the differences). After that I turned my PIRATE.gene_families.tsv file in the long format with R I simply did a left join with the newly created table and got my original IDs back. That's why I was suggesting to automatically create this correspondence table. I would have noticed an additional file.

SionBayliss commented 4 years ago

I will look into something for the next release. If you have any other ideas or issue let me know!