LegumeFederation / legfed_gene_families

A repository for managing tasks relating to the production of gene families for use by the Legume Federation
0 stars 0 forks source link

need polypeptides fastas for outgroup datasets #9

Closed adf-ncgr closed 7 years ago

adf-ncgr commented 7 years ago

they will need to be represented in chado before I can load the trees/MSAs. I might be able to extract them from the individual member files in 32_family_fasta but would be more convenient (and possibly less error prone) if you could point me to your initial input files (I don't see anything in private datastore area that looks promising, but may have overlooked). Some of them may be unchanged from previous loading (e.g. arath, vitvi) but some are species not previously included (e.g. cucsa) and some I think have newer versions than previously included (e.g. prupe).

StevenCannon-USDA commented 7 years ago

The proteomes are now at genefams_ks_mcl_2017/01_proteomes/ I'm now assessing which ones are missing the splice variant in the names. It looks like glyma, phavu, tripr, aradu, araip (though we know that aradu and araip are all .1).

StevenCannon-USDA commented 7 years ago

Assessment of the stripped splice variants: For glyma and phavu, the Phytozome _primaryTranscriptOnly variants are all ".1" For tripr, I used tripr.MilvusB.gnm2.ann1.DFgp.protein_primaryTranscript.faa without modification For aradu and araip, the splice variants are all ".1"

Andrew: should I go through all of the intermediate and final files (alignments, trees, HMMs, etc.) and add .1 to glyma, phavu, aradu, araip? Note that glyma also has a ".p" suffix, which I am reluctant to add back. How about tripr: do we need name changes?

adf-ncgr commented 7 years ago

@cann0010 my renaming script can handle the intermediate/final files (doing the trivial "add .1" for glyma, phavu, aradu, araip and using a lookup to deal with the tripr case).

Do you know how the tripr.MilvusB.gnm2.ann1.DFgp.protein_primaryTranscript.faa file got created? My feeling is that we want to fix this and ensure that our methods for constructing these files does not re-introduce the issue when new species come in.

Regarding the glyma ".p" suffix, I have mixed feelings about it; the same is true for vigun- ie the phytozome naming has .p, not sure why this isn't the case for their phavu too, but I guess they are as consistently inconsistent as we are ;) I kind of don't like changing the naming that the source provider uses just because we don't like it (NB- I except from this our practice of adding standard namespace-prefixes); but on the other hand, in the procedure for actually including proteins in the chado, because these aren't directly represented as features in the gff, they end up initially getting auto-named and then we rename them after the fact- and it is easy and convenient to name them identically with the mRNA from which they were derived- not that that makes it right. I seem to remember that Zea mays had a slightly different convention (e.g. _T1 vs _P1), so I think it's not unique to phytozome to think that maybe protein names ought to be distinguished.

so, I think I could probably go either way on this, but feel like it ought to be a "federation decision" as to best practices in this regard (a bit bureaucratic, perhaps, but it seems that in theory it is a good thing to try not to make arbitrary decisions that will potentially impact our members; I will probably violate this highly principled dictum at least 3 times before it comes to vote...)

StevenCannon-USDA commented 7 years ago

In case it's useful, I have added the .1 suffix to the four directories in genefams_ks_mcl_2017_patched/ I haven't yet replaced the corresponding directories in genefams_ks_mcl_2017/ ; will wait for OK from you.

Regarding the confusion with tripr: I don't know. The mix-up between the _mRNA names and the _gene names is in the data store. There was so much back-and-forthing during the preparation of this data set that I don't think a post-mortem is likely to be useful. However, I note that for some analyses I do (e.g. Ks analysis), it is problematic to have names with different formats for the same gene, e.g. _mRNA2568 and _protein2568. So I may well have "picked one" for both the protein_primaryTranscript and the cds_primaryTranscript.

for file in *.f??.gz; do echo $file; zcat $file | head -1 | cut -f1 -d' '; echo ; done Having the _gene format: ahrd OK cds_primaryTranscript XXX protein_primaryTranscript OK

Having the _mRNA format: cds OK protein XXX

adf-ncgr commented 7 years ago

it's fine to replace genefams_ks_mcl_2017, but thanks for letting me know in advance! regarding the Ks analysis, that is a point worth considering as we elaborate our standards for data store representation. I agree that a postmortem on tripr isn't worth your time, I'm just envisioning a future that doesn't involve https://en.wikipedia.org/wiki/Eternal_return :)

StevenCannon-USDA commented 7 years ago

OK: I have replaced 01_proteomes/, 32_family_fasta/, 33_hmmalign_trim2/, 37_trees_combined/ I think these are the only directories affected. My notes are in genefams_ks_mcl_2017/notes/ (Eternal_return : I fully expect that our counterparts will be discussing inconsistent identifiers several billion years hence - or however time works in this cosmology. In fact, I suspect I didn't like when we had to deal with this during the last universe either :-)