Closed tschaffter closed 1 year ago
At a first glance of the available tags, I think starting with the Research
tag will be appropriate.
As I was perusing through the available data, I realize that the Kaggle community has already done some preliminary analysis for us! For example, this notebook shows that Kaggle has had 147 "Research" competitions. This seemed like a promising start, however, after some further digging, "Research" in this context, actually means experimental challenges, not actual research-related challenges. (source)
So then, I looked at what the available tags were -- in total, there are 725 tags :upside_down_face: I did not want to look at each of these individually, so instead, I looked for tags with "research" in the tag name, and there was only one (named research
). Kaggle defined research
as:
Research is our endeavor to systematically increase our knowledge about the world. Whether
it's undertaken by greats like Einstein or underpaid graduate students, you'll find the fruits
of their labor in this tag plus the kernels that make their work reproducible.
There are a total of 7 challenges with this tag (and 441 datasets!). This seemed like a small number of challenges, so then I expanded the search to look at all tags with "science and technology" in its FullPath (ontology of the tag, e.g. research
is subject > science and technology > research
). This resulted in 34 tags:
0 websites
1 research
2 search engines
110 science and technology
112 biotechnology
113 energy
115 manufacturing
118 robotics
119 renewable energy
120 artificial intelligence
121 computer science
122 internet
123 mobile and wireless
124 programming
125 software
126 electronics
127 engineering
128 transportation
129 automobiles and vehicles
130 aviation
131 cycling
132 rail transport
133 water transport
240 email and messaging
241 online communities
242 social networks
243 electricity
244 oil and gas
247 tpu
249 accelerators
250 gpu
308 python
309 r
310 sql
Of those found, only biotechnology
seemed the most relevant to our needs. This will bring the total to 8 challenges. The available metadata of these 8 challenges (as well as my logic to getting to getting to this point) are gathered in this notebook here.
@tschaffter @rrchai @chepyle Let me know if you think starting with these 8 will be good enough for our closed launch. If so, I will go ahead and move forward with these challenges for Kaggle. Otherwise, we may need to continue exploring what "biomedical" means in terms of searching through Kaggle.
Kudos for creating this notebook!
I can't run it but I'll bug you tomorrow (today is off).
I think that the criteria used to identify research/biomedical challenges are too restrictive. For example, there are 124 Competitions for the search term "cancer".
Relevant tags include "medicine", "genetics", "healthcare". These tags returns much less challenges than "cancer". In total with these tags there should be about 124-140 results. We could probably generate a list, then manually review it. Let's remember the challenges that we have discarded so that we can ignore them in a future review.
We should aim to have as many challenges registered in OpenChallenges as possible for the release of the private preview so that the search features we provide seem useful. We could limit the collection/ingestion of data to a subset of fields (e.g. name, description and a few others) and complete the data later once the registry has launched (e.g. name of the individual organizers).
Thanks for the notebook @vpchung ! It looks like the Kaggle supplied tags are rubbish, some relevant challenges like this one have tags that are too generic. If we want a manually curated list, I'll leave the decision to the team, but suggest that we try for recent (>2021) challenges from competition searches of "medicine", "cell", and "cancer", a caveat of the sitewide search is it pulls in a lot of community competitions which would be too small (<20 teams) to feature on OpenChallenges
@tschaffter @chepyle thank you both! I took both of your suggestions and applied a "search" with the following terms:
This increased our total number of challenges to 40. This lower-than-expected number is due to Kaggle not dumping their entire database into the CSVs. I then tried using their public API, hoping that will give us more results, but unfortunately, those only return active challenges 🙃
I will start with this 40 for now since the metadata is already readily available. I will manually curate for more if I still have time after.
Querying the public API was a great idea. I'm starting to put some thoughts into how to capture new challenges on Kaggle in #1217 to make them directly available in the OpenChallenges DB. We will still need to rely on the Kaggle archive for past challenges.
FYI
ChatGPT: The best way to find all biomedical challenges on Kaggle is to visit the Kaggle website and navigate to the "Explore" page. Then, you can use the filter options on the left side of the page to narrow down the results to only show biomedical challenges by selecting "Health" or "Life Sciences" under the "Industry" category. Additionally, you can also search for specific keywords related to biomedical challenges in the search bar at the top of the page.
The answer is out-of-date since the filters do not exist anymore, but maybe "Health" and "Life Sciences" along with "Industry" are also good tags if exist in the metadata to get biomedical challenges. :)
Thanks for the suggestion, @rrchai ! I added "life sciences" and "health" to the list of terms to search for, and was able to get 13 more challenges, bringing the total to 53.
I'll have to do some QC though, as I've noticed some that may not be biomedical, e.g. NFL Health & Safety - Helmet Assignment
Kaggle is the largest platform in terms of total number of challenges organized, so we probably want to include relevant results in the challenge registry. Based on the number of "biomedical" challenges found, we could then discuss how to include these data in the registry.
Notes
Tasks