Open edeutsch opened 2 years ago
okay, after discussion with collaborators, please use this list for ATMGs to keep: ATMG01190 ATMG00640 ATMG00410 ATMG01170 ATMG00480 ATMG01080 ATMG00960 ATMG00110 ATMG00900 ATMG00830 ATMG00180 ATMG00220 ATMG01360 ATMG00160 ATMG00730 ATMG00520 ATMG00516 ATMG00285 ATMG00990 ATMG00580 ATMG00650 ATMG00060 ATMG00270 ATMG00510 ATMG00070 ATMG01000 ATMG00690 ATMG00075 ATMG01430 ATMG01420 ATMG00570 ATMG00560 ATMG00080 ATMG00210 ATMG00980 ATMG00090 ATMG00290 ATMG01270
After discussion with collaborators, please use this for ATCGs to keep: ATCG00500.1 ATCG00120.1 ATCG00480.1 ATCG00470.1 ATCG00130.1 ATCG00140.1 ATCG00150.1 ATCG00670.1 ATCG00040.1 ATCG01090.1 ATCG00420.1 ATCG00430.1 ATCG01100.1 ATCG00890.1 ATCG00440.1 ATCG01050.1 ATCG01070.1 ATCG01010.1 ATCG01080.1 ATCG01110.1 ATCG00540.1 ATCG00720.1 ATCG00730.1 ATCG00600.1 ATCG00590.1 ATCG00210.1 ATCG00350.1 ATCG00340.1 ATCG01060.1 ATCG00510.1 ATCG00630.1 ATCG00020.1 ATCG00680.1 ATCG00280.1 ATCG00270.1 ATCG00580.1 ATCG00570.1 ATCG00710.1 ATCG00080.1 ATCG00550.1 ATCG00070.1 ATCG00560.1 ATCG00220.1 ATCG00700.1 ATCG00690.1 ATCG00490.1 ATCG00780.1 ATCG00790.1 ATCG00660.1 ATCG00830.1 ATCG00810.1 ATCG00840.1 ATCG01020.1 ATCG00640.1 ATCG00760.1 ATCG00740.1 ATCG00190.1 ATCG00180.1 ATCG00170.1 ATCG00750.1 ATCG00065.1 ATCG00330.1 ATCG01120.1 ATCG00050.1 ATCG00650.1 ATCG00820.1 ATCG00160.1 ATCG00800.1 ATCG00380.1 ATCG00900.1 ATCG00770.1 ATCG00530.1 ATCG01130.1 ATCG00870.1 ATCG00860.1 ATCG00360.1 ATCG00520.1 ATCG01040.1 ATCG00300.1
These two new lists supercede the previous lists you were using.
The next goal will be to prepare a refined Arabidopsis proteome that we can publish. The current state is very messy. Araport11 is the latest reference, but it has lots of extra entries and is also missing entries. We will aim to fix that as best we can.
So we need a program that can do the following, and probably be extended a bit more with some additional operations.
Read Araport11 to start a "master reference proteome" for Arabidopsis
Programmatically start applying the following set of operations to improve it and finish by writing out a new proteome
Remove all entries with suffix .2 or higher that are identical is sequence to their xxxxxx.1 version
Using the Excel file (Arabidopsis_plastid_and Mito-encoded annotated proteins4.xls) in the Google Drive folder https://drive.google.com/drive/u/0/folders/1WQ2RwsGl0I8KNCv2azdaQUMj-cdOJ8ad delete any ATMG entries that aren't found in column A of the spreadsheet (mitochondria tab) that don't have a value in column B (i.e. discard the genes not in the white rows) and delete any ATCG entries not found in column B of the spreadsheet (plastid tab)
Read TAIR10
Find all cases where TAIR10_xxxxxxx.1 has a xxxxxxxx.n match in Araport11 and insert the "Symbols: ... |" string in the description from TAIR10 into the new reference (started from Araport11)
Add a blank "Symbols: |" into all other entries with no match from TAIR10
Write it out
There will likely be some additional improvements