billimarie / prosecutor-database

An open-source, community oversight dataset of all U.S. Prosecutors. Happy Hacktoberfest 🎃
https://billimarie.github.io/prosecutor-database
Other
88 stars 82 forks source link

[APP] Force SSL #130

Open billimarie opened 3 years ago

billimarie commented 3 years ago

A lot of the images we are hotlinking to are https. Is there a method we can use to force https?

michaelknowles commented 3 years ago

I would advise that we not hotlink to other websites' images. This is stealing their bandwidth and opens the website up to abuse. For example, if we are using another website's image, they could change that image to whatever they want (i.e. something inappropriate) while keeping the same URL so that we show it on our website unknowingly.

Instead, we should be uploading our own versions/copies of images and using them.

billimarie commented 3 years ago

Hi @michaelknowles, I agree. A few years back, we used Sirv for image hosting. The only problem was we were slowed down in our data collecting by having to save, upload, & tag each image.

We then shifted to a model where images were stored on GitHub in a folder called headshots; a little bit faster, but again, the data import was slowed due to the need to organize the images.

For this year's Hacktoberfest, if you'd like to take the lead on steering us toward a sustainable image hosting solution, I'd love to assign you an updated issue. What are your thoughts?

michaelknowles commented 3 years ago

It looks like people are just supplying links to images in the JSON.

{
  ...
  "headshot":"https://www.pdaa.org/wp-content/uploads/2019/07/adamsco.jpg",
  ...
}

Can you explain how this JSON is then getting uploaded into the database? Ideally, we'd have a script that is uploading this data. The same script would fetch the linked image, transform it, then store it somewhere.

As for where to store the images, there are a couple options:

billimarie commented 3 years ago

It sounds like writing that kind of script is the first step. An older scrapper I wrote might be able to be tweaked. It stripped necessary data from local prosecutor websites. You can find it here: https://github.com/billimarie/prosecutor-database/blob/501ef012324d3d11f520bb9aeeb334beb32f4278/README.md#optional-python-script

Your work might crossover with @janel-developer, who is researching alternative data sources. You can check out that issue at #145 in case you are able to assist.

Currently, I manually review & import the data via Terminal. It is entirely possible to create a script to first sanitize the data, then import it to MongoDB; the hard part (which we have attempted & did not succeed at numerous times) is creating a scrapper that uniformly grabs the data from multiple sources in the first place.

Are you interested in tackling any of these issues? If so, we can create a new issue for you. There is data within the 100 DA folder which you can experiment with, & notes in the DOCS.md on how to spin up a MongoDB instance.

michaelknowles commented 3 years ago

We can keep this as two separate scripts:

That way we can work in parallel and will also ease development of unit tests.

billimarie commented 3 years ago

Sounds good! Which would you like to work on first? Feel free to create an issue if you have time.

michaelknowles commented 3 years ago

Lets see what discussion happens on that other issue first. I don't want to create duplicate or conflicting work.

billimarie commented 3 years ago

@michaelknowles That issue is not for creating scripts; it is for researching data sources.

michaelknowles commented 3 years ago

Ah got it. In that case, I can work on the upload script first. I'm assuming the data will be stored in the same JSON format.