NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Naming convention of tandem domains #37

Closed alekhyaa2 closed 1 year ago

alekhyaa2 commented 1 year ago

Is your feature request related to a problem? Please describe. Tandem domains in the uniprot reference file that are obtained through InterPro fetch have same names. Issues due to this in downstream analysis. 'domains' in PDB reference file have different domain names for tandem domains.

Describe the solution you'd like Suggest we process the domain names to a specific format we would want to use consistently throughout the pipeline and use them for referencing. For example, PTPN11 --- currently it is SH2|SH2|PTP_cat in uniprot reference file and SH2|SH2_2|PTP_cat in PDB reference file we can change it to SH2_1|SH2_2|PTP_cat

knaegle commented 1 year ago

It is unclear if this is a better solution or not, but it's partly because the ticket is unclear about what type of downstream processing is a problem.

Right now, the fasta headers are unique, since they include the domain number and the start and stop position. I find having different domain names confusing for automatic processing and I think it's a benefit of Interpro, compared to Uniprot naming conventions. There is no consistent way to include a designation of the domain number that would be parseable (e.g. SH32 means a type 2 SH3 domain, and there are other domain names that use '-' as well as ''.

Perhaps, you can add specifics as to what it is that is an issue in downstream processing.

knaegle commented 1 year ago

Closing this ticket after a discussion. We determined that the current behavior which is in the fasta headers (numbering domains) is the preferred solution. But, we could reconsider adding a new field in the Uniprot CSV reference for domains to include numbering.