Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
23 stars 12 forks source link

Evaluate the number of biomedical challenges on Kaggle #936

Closed tschaffter closed 1 year ago

tschaffter commented 2 years ago

Kaggle is the largest platform in terms of total number of challenges organized, so we probably want to include relevant results in the challenge registry. Based on the number of "biomedical" challenges found, we could then discuss how to include these data in the registry.

Notes

Tasks

vpchung commented 2 years ago

At a first glance of the available tags, I think starting with the Research tag will be appropriate.

vpchung commented 1 year ago

As I was perusing through the available data, I realize that the Kaggle community has already done some preliminary analysis for us! For example, this notebook shows that Kaggle has had 147 "Research" competitions. This seemed like a promising start, however, after some further digging, "Research" in this context, actually means experimental challenges, not actual research-related challenges. (source)

So then, I looked at what the available tags were -- in total, there are 725 tags :upside_down_face: I did not want to look at each of these individually, so instead, I looked for tags with "research" in the tag name, and there was only one (named research). Kaggle defined research as:

Research is our endeavor to systematically increase our knowledge about the world. Whether
it's undertaken by greats like Einstein or underpaid graduate students, you'll find the fruits
of their labor in this tag plus the kernels that make their work reproducible.

There are a total of 7 challenges with this tag (and 441 datasets!). This seemed like a small number of challenges, so then I expanded the search to look at all tags with "science and technology" in its FullPath (ontology of the tag, e.g. research is subject > science and technology > research). This resulted in 34 tags:

0                      websites
1                      research
2                search engines
110      science and technology
112               biotechnology
113                      energy
115               manufacturing
118                    robotics
119            renewable energy
120     artificial intelligence
121            computer science
122                    internet
123         mobile and wireless
124                 programming
125                    software
126                 electronics
127                 engineering
128              transportation
129    automobiles and vehicles
130                    aviation
131                     cycling
132              rail transport
133             water transport
240         email and messaging
241          online communities
242             social networks
243                 electricity
244                 oil and gas
247                         tpu
249                accelerators
250                         gpu
308                      python
309                           r
310                         sql

Of those found, only biotechnology seemed the most relevant to our needs. This will bring the total to 8 challenges. The available metadata of these 8 challenges (as well as my logic to getting to getting to this point) are gathered in this notebook here.

vpchung commented 1 year ago

@tschaffter @rrchai @chepyle Let me know if you think starting with these 8 will be good enough for our closed launch. If so, I will go ahead and move forward with these challenges for Kaggle. Otherwise, we may need to continue exploring what "biomedical" means in terms of searching through Kaggle.

tschaffter commented 1 year ago

Kudos for creating this notebook!

I can't run it but I'll bug you tomorrow (today is off).

I think that the criteria used to identify research/biomedical challenges are too restrictive. For example, there are 124 Competitions for the search term "cancer".

Relevant tags include "medicine", "genetics", "healthcare". These tags returns much less challenges than "cancer". In total with these tags there should be about 124-140 results. We could probably generate a list, then manually review it. Let's remember the challenges that we have discarded so that we can ignore them in a future review.

We should aim to have as many challenges registered in OpenChallenges as possible for the release of the private preview so that the search features we provide seem useful. We could limit the collection/ingestion of data to a subset of fields (e.g. name, description and a few others) and complete the data later once the registry has launched (e.g. name of the individual organizers).

chepyle commented 1 year ago

Thanks for the notebook @vpchung ! It looks like the Kaggle supplied tags are rubbish, some relevant challenges like this one have tags that are too generic. If we want a manually curated list, I'll leave the decision to the team, but suggest that we try for recent (>2021) challenges from competition searches of "medicine", "cell", and "cancer", a caveat of the sitewide search is it pulls in a lot of community competitions which would be too small (<20 teams) to feature on OpenChallenges

vpchung commented 1 year ago

@tschaffter @chepyle thank you both! I took both of your suggestions and applied a "search" with the following terms:

This increased our total number of challenges to 40. This lower-than-expected number is due to Kaggle not dumping their entire database into the CSVs. I then tried using their public API, hoping that will give us more results, but unfortunately, those only return active challenges 🙃

I will start with this 40 for now since the metadata is already readily available. I will manually curate for more if I still have time after.

vpchung commented 1 year ago

Updated notebook here

tschaffter commented 1 year ago

Querying the public API was a great idea. I'm starting to put some thoughts into how to capture new challenges on Kaggle in #1217 to make them directly available in the OpenChallenges DB. We will still need to rely on the Kaggle archive for past challenges.

rrchai commented 1 year ago

FYI

ChatGPT: The best way to find all biomedical challenges on Kaggle is to visit the Kaggle website and navigate to the "Explore" page. Then, you can use the filter options on the left side of the page to narrow down the results to only show biomedical challenges by selecting "Health" or "Life Sciences" under the "Industry" category. Additionally, you can also search for specific keywords related to biomedical challenges in the search bar at the top of the page.

The answer is out-of-date since the filters do not exist anymore, but maybe "Health" and "Life Sciences" along with "Industry" are also good tags if exist in the metadata to get biomedical challenges. :)

vpchung commented 1 year ago

Thanks for the suggestion, @rrchai ! I added "life sciences" and "health" to the list of terms to search for, and was able to get 13 more challenges, bringing the total to 53.

I'll have to do some QC though, as I've noticed some that may not be biomedical, e.g. NFL Health & Safety - Helmet Assignment