In #59 , I added additional Y matrices for use in our ML pipeline. One of these matrices included GO term annotations per compound. However, there were over 5,000 different terms in that PR.
Here, I filter the GO terms to include only those that have annotations for greater than or equal to 20 compounds. Most GO terms (about 1,000) had annotations for only 1 compound.
This filtering step reduced the dimensions to 772 unique GO terms. This will be much better to train multi-label classifiers in part 2 of the ML analysis
In #59 , I added additional
Y
matrices for use in our ML pipeline. One of these matrices included GO term annotations per compound. However, there were over 5,000 different terms in that PR.Here, I filter the GO terms to include only those that have annotations for greater than or equal to 20 compounds. Most GO terms (about 1,000) had annotations for only 1 compound.
This filtering step reduced the dimensions to 772 unique GO terms. This will be much better to train multi-label classifiers in part 2 of the ML analysis
See #60 for more details