Open swarmee opened 7 years ago
@swarmee this looks really interesting. Can you confirm the license on the datatsets?
hi @swarmee looking at the datasets its unlikely there are licensing issues. They are, however, very small. Would it maybe make sense to consolidate into a single example? i.e. a script that loads all of them? This might make for an interesting example.
@asawariS @jamiesmith thoughts? It also might make sense to simplify the scripts and use ingest node to parse the contents.
Hey Dale, Thanks for the feedback,
Most of the raw data comes straight from wikipedia - using good old google sheets "IMPORTHTML". Looking at the wikimediafoundation terms of use it would seem pretty clear that provided any examples created, provide the same Wikipedia licencing (i.e "CC BY-SA" and "GFDL") and there is attribution (link to the original webpage - which is included in a markdown visualisation on the dashboard) I believe licencing of the raw data is covered off.
In relation to licencing of the openstreetmaps geo-coordinates which are obtained at the time of running the logstash scripts - that data is provided under the ODbL 1.0 licence. Which clearly says that you are free to copy, distribute, transmit and adapt our data, as long as you credit OpenStreetMap and its contributors (which is also covered off by a markdown visualisation on the dashboard).
There are a few other essentially free data sources including openaddresses for Australia locations, maxmind's free world city database and Australia's bureau of meteorology in the repository. The only one which is a bit sketchy is the Australian stock market data source from here www.asxhistoricaldata.com.
Yes most of the datasets are all super small - which is good in someways (i.e the data loads fast). It sort of all depends on what you are trying to provide an example on.
If you want to show people how they can use logstash to get data into elastic easily (from csv) or how pretty a dashboard you can create in kibana - it probably does not matter how big the dataset is. However if you want people to actually use the example to demonstrate fast search (the key feature of elastic) then obviously you want a large data set. The openaddresses example has around 14 million geo-coded Australian addresses in it, if you want and example like that- here's a screen shot of that dashboard (http://www.swarmee.net/images/pic05.jpg).
Having a script to load all the small Wikipedia bases data sources would be super easy. I would probably want to load the Kibana index configuration as part of the process as well - so users could just go straight into the dashboards.
In relation to ingest configuration - that's not something I have great familiarity with. Historically I have found it much easier to log in JSON, then load data straight into elastic, and if we really have to use a scripted field to parse a particular field's data.
Let me what you are thinking...
I only thumbed through a couple of the scripts. It seems like you can have a script at the top level of the hierarchy that is like "dataset-loader.sh" and specify the folders that you want to load, then have it figure out the rest.
Something like: ./dataset-loader.sh top-sellings-books
Note that in the scripts you really need the shebang, currently they assume that you are running bash. Add this:
#!/bin/bash
You should also include a way to pass in username and password (or prompt for the password)
Script would be something like:
#!/bin/bash
HOST="localhost"
PORT=9200
USER=""
PASSWORD=""
CREDS=""
function die_usage
{
echo "$*"
cat <<EOF
usage: [-h hostname][-p port][-u username][-s password]
-h host...
...
EOF
exit 9
}
while getopts "h:p:s:u:" option
do
case $option in
h)
HOST="$OPTARG"
;;
p)
PORT="$OPTARG"
;;
s)
PASSWORD="$OPTARG"
;;
u)
USER="$OPTARG"
;;
*)
die_usage "Unsupported argument"
;;
esac
done
shift `expr $OPTIND - 1`
if [ -n "$USER" && -z "$PASSWORD" ]
then
# Prompt for password - use read
echo blah
fi
for directory in $*
do
savedir="$(pwd)"
cd $directory
# Then do the load. You would need to genericize this, and do a for file in *.gz
#
done
Hey @jamiesmith, yes they are pretty much all the same structure
Thanks of the advice - re the wrapper script. I'll play around on the weekend to take advantage of your suggestions.
Sorry for not being that good :)
I am more than happy to help you flesh out the script. One of my skills is that I am lazy, so I script everything ;). Give it a shot. It is useful to just throw an "echo" in front of commands, that way it is non-destructive.
Note you will also want to set the credentials in that if/user thing CREDS="--user ${USER}:${PASSWORD}", then when you do the curl you just add curl $CREDS ...
There are likely edge cases that are tricky but we can get past those.
Hey @gingerwizard I noticed that you were in portugal (or at least that what linked in says) and from what I could see Portugal has a pretty good openaddresses dataset. So I copied my Australian openaddress loader and tweaked for Portgual and added to the repo. Looks like this http://www.swarmee.net/images/openaddresses-portugal-dashboard.png
Hey @jamiesmith
I had a crack at the bash script but as the datasets were so small i thought I would just fall back to posting the data in using curl (it handles passwords much better than I could).
I have updated the below three examples to just use curl to put in the mapping and post in the data. I have updated the readme's as explained please take a look.
https://github.com/swarmee/swarmee.datasets/tree/master/fastest-humans-over-100m https://github.com/swarmee/swarmee.datasets/tree/master/highest-grossing-animated-films https://github.com/swarmee/swarmee.datasets/tree/master/top-tallest-mountains
I actually got diverted on creating some test data and doing a write up for data modelling in elastic - its over here. I'm sure somebody has already written this up somewhere - however I can see the benefit just for me to have it written down. :)
You would still need to use curl to post the data, that is the "then do the load" part of the script that I posted. All of the stuff from the getops
would simple be making arguments for you to pass off to the curl script.
At the very least, every script should start with a #!/bin/bash
so it knows what interpreter to use.
Note that it appears the write up that you refer to is in a private repo.
Sorry Jamie, I have been sucked into some other stuff lately. I am sure I can get this 'getops' thing working just need a little time. I have renamed that other write up I was working on --> I'm pretty happy with it cause it documents most of the stuff I have learnt about elasticsearch. https://github.com/swarmee/partySearch
I wonder if the below is sufficient to post in all of the files in a directory.
echo "#####################################################" echo "The below parameters are required to post data into your cluster, if you are running on your localhost with no security you can just go with the defaults" echo "#####################################################"
read -p "Username (default:elastic): " CURLUSER read -p "Password (default:changme): " CURLPASS read -p "Host (default:localhost): " CURLHOST read -p "Port (default:9200): " CURLPORT
: ${CURLHOST:="localhost"} : ${CURLPORT:="9200"} : ${CURLUSER:="elastic"} : ${CURLPASS:="changeme"}
for file in ./*.dsl do echo "curl -H 'Content-Type: application/json' -XPOST -user "${CURLUSER}":"${CURLPASS}" "${CURLHOST}":"${CURLPORT}"/_search -d @"${file}"" curl -H 'Content-Type: application/json' -XPOST -user "${CURLUSER}":"${CURLPASS}" ""${CURLHOST}":"${CURLPORT}"/_search?pretty" -d @"${file}" done
No worries...
I think that you have to take into consideration that they might not be using a secured cluster
Heya :),
So I have been creating some scripts to load and view sample datasets over here --> https://github.com/swarmee/swarmee.datasets
Basically they take data from CSV to being visualised in Kibana.
I have cleaned up the script, read me and checked the process on 5.5, for one of the scripts here --> https://github.com/swarmee/swarmee.datasets/tree/master/fastest-humans-over-100m --> I wonder if that that kind of information would be useful here. The other examples are pretty good just need more clean up if they are deemed to be useful.
Plenty of screenshots here of the other datasets --> http://www.swarmee.net/datasets.html
For me, picking up the elastic stack at my real job over the last year, the hardest thing was finding good logstash configuration examples to follow (particularly for non logging examples). To me these examples illustrate best practice "when building" logstash configurations. E.g. piping the input data to logstash, using the a mapping file template, being able to create multiple versions of the underlying data but presenting them with an alias.