StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
26.38k stars 2.19k forks source link

Proposal: Category "fake science" / malicious journals #720

Open pascalwhoop opened 6 years ago

pascalwhoop commented 6 years ago

this project mirrors a list that was taken down recently. The idea is to help researchers beware of fake journals. This could be very useful for academic institutions that make sure their researchers aren't tricked into publishing in such fake journals. Obviously this requires good crowdsourcing to ensure the domains are actually fake and not legit but small journals. Kicking off a discussion to see what others think.

I think this repo is a great place to embed this into. You have reputation, experience and the toolchain to manage such a list efficiently and publicly. There was a recent study done by my University that found that 5% of all German researchers have been tricked once or more and that several thousand researchers worldwide were fooled by these.

I'd be happy to turn that linked list into an initial hosts file but I'd like to make sure somehow that these are actually all fake and I am not yet sure how that could be easily achieved.

welcome[bot] commented 6 years ago

Hello! Thank you for opening your first issue in this repo. It’s people like you who make these host files better!

StevenBlack commented 6 years ago

Hi @pascalwhoop that's a very interesting idea. Thanks!

katrinleinweber commented 6 years ago

Nice :-) Should the list be maintained here then, or over at @stop-predatory-journals?

I wonder whether it would be possible to generate the journals & publishers lists in such a machine-readable hosts file (or two), and auto-generate the website from them?

pascalwhoop commented 6 years ago

@katrinleinweber you can definitely generate the lists (hosts -> csv -> website) automatically, as long as the base list follows some strict pattern. The rest can be done with grep and sed. I wrote one script that did most of the work from the csv files to hosts files but the csv files are a bit messy and so I didn't continue. More importantly, how do we ensure that these lists are "true"? I imagine there is a gradient between predatory journals and just a really unpopular / unimportant ones. What would be a good "in or out" determination criteria. Alternatively, we could have 3 categories with increasing levels of "probably evil". Then, universities could manage these and handle them differently. They could have a yellow warning, red warning and finally complete block of the host from within their network.

I will contact my universities network administrators and see what they have set up in terms of infrastructure. Hosts are a good start for plain DNS blocking but there may be some other ways that are a bit more complex but gentle, like the "this is malware, continue anyways?" page that chrome sometimes displays. One could have an internally hosted application that says "this is a bad journal known to trick people, continue anyways?" and if selected, the researcher is forwarded to the actual website.

katrinleinweber commented 6 years ago

More importantly, how do we ensure that these lists are "true"?

In whatever way @stop-predatory-journals is currently using. See https://github.com/stop-predatory-journals/stop-predatory-journals.github.io/pull/1#discussion_r160742358 for example. That's why I think also a hosts file should either be maintained there, or auto-generated from their source.

What exactly is wrong with their CSV files? I imagine they can be cleaned up so that they lend themselves to being sed straight automatically.

pascalwhoop commented 6 years ago

I was having troubles catching this line for example https://github.com/stop-predatory-journals/stop-predatory-journals.github.io/blob/master/_data/journals.csv#L361

also line 404 and 477

spirillen commented 5 years ago

After reading into this project goal, I must admit it's a good idea, but how would you differ the greedy from the hoax sites?

As I understand this repo it's not against greedy basters as github.com (Microsoft) would have been add to the hosts file..