Open rankishore opened 6 years ago
In some cases the information loaded into the pipeline is filtered and often modified in a way that it can't be compared with the original data in the raw files. I would start by defining fields that must (or mustn't) be null.
Add the number of 'information poor' genes for each release to the report 'number_of_concise_descriptions.txt', defined as those genes for which we add protein domain information and/or orthologous human gene molecular function and/or expression cluster information.
Create a check that compares the numbers of genes with a specific data type in the data source file with the number of genes with that type of data in the gene descriptions file. Example-the number of C. elegans genes with orthology to human genes in the orthology data source file for WS269 should match the number of C. elegans genes with the orthology module in the gene descriptions file for WS269.