VIB-PSB / MINI-AC

Motif-Informed Network Inference based on Accessible Chromatin (MINI-AC) is a method that combines accessible chromatin data from bulk or single-cell experiments with transcription factor binding site enrichment to learn gene regulatory networks in plants
Other
3 stars 0 forks source link

Update of Arabidopsis gene-GO file #16

Open nicomaper opened 11 months ago

nicomaper commented 11 months ago

The current Arabidopsis gene-GO file is missing gene-GO pairs that come from ‘high throughput’ evidence code (HTP). It should be updated using the gene-GO file from PLAZA 5.0

hdbeukel commented 11 months ago

There are actually even more evidence types missing (see GO website): Inferred from High Throughput Experiment (HTP) Inferred from High Throughput Direct Assay (HDA) Inferred from High Throughput Mutant Phenotype (HMP) Inferred from High Throughput Genetic Interaction (HGI) Inferred from High Throughput Expression Pattern (HEP)

hdbeukel commented 11 months ago

This is the reprocessed TAIR10 annotation file (BP, curated and experimental annotations only, extended to parental terms), now including high-throughput experimental annotations: ath_BP_cur_exp_extended_tair10.txt (392.957 annotations)

The respective annotations as processed from PLAZA: ath_BP_cur_exp_extended_plaza.txt (394.411 annotations)

As you can see they do differ a bit. As expected, PLAZA contains annotations that were missing in TAIR10, but the reverse is also true. Ignoring the specific evidence types, there are 358.530 (~90%) annotations in common between PLAZA and TAIR10. The number of specific annotations present in one set but not in the other, is summarised in the table below.

# Specific annotations ATXXX ids non-ATXXX ids
PLAZA 35.881 0
TAIR10 25.443 8.984

We argued that not having the ~9k non-ATXXX ids that were unique to TAIR10 was desired, but what about the >25k ATXXX gene annotations that are unique to TAIR10? Should we include these as well, in addition to the PLAZA annotations?

nicomaper commented 11 months ago

Alright, but maybe first we should find out why they are not in PLAZA, because maybe there is a reason for that. Perhaps it is just that the TAIR annotation has been updated after the PLAZA release, in which case I would be in favor of adding them, but maybe there was another reason (quality, etc.). Knowing that would be important to make a decision on whether to include them or not.

hdbeukel commented 11 months ago

Ok so we decided to include all PLAZA annotations and the ATxGxxx gene annotations from TAIR10 that were not in PLAZA. As the PLAZA v5 data has been generated about three years ago, the missing annotations are likely new annotations.

This would be the new annotation file for Arabidopsis: ath_go_gene_file.txt. @nicomaper can you check it before I make a pull request?

Data has been extended to parental terms and filtered for:

In case of duplicate annotations (same gene, same GO term) only the one with the highest priority (most relevant) evidence code has been retained (exp > cur).

hdbeukel commented 11 months ago

As discussed I will reprocess the file to remove GO terms with over 1.000 annotated genes, to avoid testing for enrichment of very general terms.

hdbeukel commented 11 months ago

@nicomaper after filtering the file to retain only annotations with less than 1.000 genes: ath_go_gene_file.txt.

hdbeukel commented 11 months ago

Now also removed obsolete ids. If the GO tree provided a replaced_by then the obsolete id has been replaced with the other id, else it has been discarded.

Final go-gene file: ath_go_gene_file.txt. Includes PLAZA 5 annotations + TAIR10 ATXGXXX annotations not found in PLAZA.

Final applied filtering:

hdbeukel commented 11 months ago

After further discussion we decided to keep all GO terms (except the BP root) in the annotation file, updated file: ath_go_gene_file.txt.

Other properties have not changed (see above).

We will further investigate to exclude generic terms from enrichment testing when performing the actual analysis, for this new options will be added to enricher.