Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Create a taxonomy export fie #156

Open blakesweeney opened 1 year ago

blakesweeney commented 1 year ago

We should create exports of the taxonomic assignments for all Rfam families. This is basically what is done the rfam-taxonomy repo but the exports should be part of our FTP. I'm thinking we should use a JSON file, which has entries for each family like:

{
      "accession": "RF00001",
      "rfam_id": "5S_rRNA",
      "description": "5S ribosomal RNA",
      "rfam_rna_type": "Gene; rRNA;",
      "domain": "Mixed",
      "seed": [
          { "name": "Bacteria", "fraction": 48.6 },
          { "name": "Eukaryota", "fraction": 45.51 },
          { "name": "Archaea", "fraction": 5.9 }
    ],
    "full": [
        { "name": "Eukaryota", "fraction": 87.59 },
        { "name": "Bacteria", "fraction": 12.0 },
        { "name": "Archaea", "fraction": 0.4 }
    ]
}

I'm not sure if it should be one object per line (jsonl) or if there should be a single object with all families. I'm open to suggestions.

ppgardne commented 1 year ago

You might get more reuse if you use a TSV:...

accession rfam_id  seedBacteria seedEukaryota seedArchaea fullEukaryota fullBacteria fullArchaea
RF00001   5S_rRNA  48.6          45.51        5.9         87.59         12.0         0.4
blakesweeney commented 1 year ago

If you think more people would prefer a tsv we can make one. We have talked about having more precise taxonomic assignments and I don't think that would fix as cleanly into a tsv.