Training dataset - Githubissues

redhog commented 8 years ago

[x] Decide on location / hosting of the training dataset (github+LFS?)
- Source (non-anonymized) dataset
- Anonymized dataset
[x] Collect all the training data we have in different systems into one place
- [x] vessel-scoring point-by-point data (@redhog)
- [x] Time ranges data (@tim) from
- [x] Kristina
- [x] Alex
- [x] DavidK
- [x] Trackpoints for the time ranges above (from Orbcomm) (@redhog)
- [x] Script to download trackpoints for chosen mmsis (from Orbcomm) (@redhog)
- [x] Script to anonymize time ranges (@tim)
- [x] Script to merge time ranges and tracks (@tim)
- [x] Transfer Kristinas labels over to Orbcomm tracks using the time ranges and test training quality
[x] Script to anonymize the data
- https://github.com/GlobalFishingWatch/training-data-source/blob/master/anonymize.py
[x] Add license file
- CC
- See https://docs.google.com/document/d/1VltrNz4jp2-5faGqZ_7u8uXcGFws-FxlJVcteyUjWZA/edit
[x] Finish the track download script
[x] Run the track download script and upload to LFS
[x] Change anonymize script to anonymize mmsis in filenames for per-mmsi track files
[x] Change merge script to use per-mmsi track files
[x] Anonymize new tracks
[x] Change the vessel-scoring library (@redhog)
- https://github.com/GlobalFishingWatch/vessel-scoring/tree/anonymous-training-data
- [x] to use this new training dataset location
- [x] link to the new dataset from documentation
[ ] Change NN code (@tim @seacourtaw )
- [ ] to use this new training dataset location
- [ ] link to the new dataset from documentation
[x] Compare training accuracy/precision for vessel-scoring library between the two datasets
- AUC was terrible
- [x] Try using David K's crowd sourced data and see if we can get the AUC up that way.
  - https://github.com/GlobalFishingWatch/vessel-scoring/blob/anonymous-training-data/notebooks/Model-Descriptions.ipynb
  - https://github.com/GlobalFishingWatch/vessel-scoring/blob/master/notebooks/Model-Descriptions.ipynb
[ ] Review anonymous dataset (@paul or @davidkroodsma )
[ ] Publish dataset

redhog commented 8 years ago

I just added you, @KristinaBoerder so that you can see what's going on. The plan is to publish a training comprehensive dataset for our open source machine learning framework. We have permission to publish anonymized tracks (where the mmsi has been masked). We're planning to publish both the labels done by Alex' and Davids' crowd sourcing apps, as well as yours if that's ok. In the case of your labels, I guess we'll have to transfer them to orbcomm data. However, that does considerably degrade our performance (recall / precission). Would it be possible for you to contact ExactEarth and try to negotiate a license to release a labeled, anonymized training dataset derived from their data?

redhog commented 8 years ago

Some notes:

I have renamed
- transit-MMSI.npz to transit_N.npz for integer N
- alex/classified-filtered.npz to alex.npz
- id_fishing_points_XXX.npz to davidk_XXX.npz

KristinaBoerder commented 8 years ago

Hi Egil - you are right, I do not have permission to publish the data and there is little chance ExactEarth would agree to that. I haven't seen yet what it looks like if you transfer my labels to orbcomm data, maybe you can do that for one of the tracks and I can check. I expect it to be rather poor results though

redhog commented 8 years ago

@KristinaBoerder The problem is that in the beginning of the time range, the orbcomm data is really sparse, and so the prediction quality from the features we have developed is really low. It's still worth doing, but it's far from as useful as the original data. One of Alex or Tim has already done this for all of your tracks, and measured our precision/recall, and it wasn't very good at all.

Is there anything else you think we could/should do to improve the situation and our dataset?

Our goal is to allow outside users of our machine learning models to improve them and contribute that back, as well as develop new models on top of the same data and compare them.

redhog commented 8 years ago

Conclusions from my chat with Kristina:

Getting permission to publish EA data virtually impossible
Her labels transfered to orbcomm data might be so bad that they're unusable, and in that case shouldn't be included.
She would like to review the orbcomm tracks with her labels

Conclusions from talks with Alex and Tim:

We will do a sprint on transfering data from the TF NN framework to this repo when we're all in oxford
We should store tracks and ranges separately

redhog commented 7 years ago

I have managed to merge Alex data with Davids and Kristina (on ORBCOMM), that is, all the anonymous data, and used all that to train and test the logistic model. Here are the results:

https://github.com/GlobalFishingWatch/vessel-scoring/blob/anonymous-training-data/notebooks/Model-Descriptions.ipynb

which can be compared to

https://github.com/GlobalFishingWatch/vessel-scoring/blob/master/notebooks/Model-Descriptions.ipynb

As can be seen, the model trained on Kristinas ExactEarth data is still better, but with all this data, the difference isn't very big any more (e.g. AUC=0.904 vs 0.965 for trawlers multiwindow), except for longliners which... Which is fairly strange...

So, what do you think the plan should be from here?

ping @bitsofbits @pwoods25443 @seacourtaw

redhog commented 7 years ago

[x] Write a data-dev post about difference between Kristinas data and Davids/Alex' and training on them and how measurements can't really be compared between them (different labeling criteria, different quality of labeling, different selection of vessels) ( @redhog )
- https://github.com/GlobalFishingWatch/data-dev/blob/master/incomparable-training/Training%20across%20Kristinas%20and%20Davids%20data.md
[ ] Train logistic regression on only Davids data and get metrics ( @bitsofbits )
[ ] Make a tileset from training with only Davids data & false positives ( @redhog )
- Depends on https://github.com/SkyTruth/benthos-pipeline/issues/832

redhog commented 7 years ago

Blocked by https://github.com/SkyTruth/benthos-pipeline/issues/601

cspoh commented 6 years ago

Wondering where can I find classified-filtered.measures.npz and slow-transits.measures.npz? They are not in https://github.com/GlobalFishingWatch/training-data

Npz data in https://github.com/GlobalFishingWatch/training-data has column 'is_fishing' while vessl_scoring code is looking for column 'classification'. How do I convert get 'classification' column or do I simply rename 'is_fishing' to 'classification'?

Thanks in advance for any advice.

redhog commented 6 years ago

Hi @cspoh ! You need to use the branch https://github.com/GlobalFishingWatch/vessel-scoring/tree/anonymous-training-data of this repository to be able to train using the anonymized published training dataset.

This is due to differences in paths, and some training data not being published (see above, regarding the data from @KristinaBoerder ).

cspoh commented 6 years ago

@redhog I had got the anonymized version working. Thank you.

Noticed the Recall for anonymized is poor compared to the master branch. This is due to the differences in the training data I supposed?

Once again thanks for the great work.

redhog commented 6 years ago

@cspoh yeah that's why, unfortunately.

What you do have in the anonymized data is the same labels, but transplanted to the same tracks from another AIS vendor. Unfortunately, those tracks are much more sparse...

To get back to the same recall you'd have to find some other tracks and label them, preferably spread out over the different gear types...

What kind of project are you working on?

cspoh commented 6 years ago

@redhog I am thinking to start a project to classify vessel into her type (ie passenger, cargo, fishing,...) using its behavior.

sakalu commented 4 years ago

Hi @cspoh ! You need to use the branch https://github.com/GlobalFishingWatch/vessel-scoring/tree/anonymous-training-data of this repository to be able to train using the anonymized published training dataset. This is due to differences in paths, and some training data not being published (see above, regarding the data from @KristinaBoerder ).

Hi @cspoh ! You need to use the branch https://github.com/GlobalFishingWatch/vessel-scoring/tree/anonymous-training-data of this repository to be able to train using the anonymized published training dataset. This is due to differences in paths, and some training data not being published (see above, regarding the data from @KristinaBoerder ).

@redhog I can not find https://github.com/GlobalFishingWatch/vessel-scoring/tree/anonymous-training-data of this repository and wondering if it has been removed? Thanks!! Saka

bitsofbits commented 4 years ago

@sakalu, I think the data you want has been moved to this repo: https://github.com/GlobalFishingWatch/training-data

sakalu commented 4 years ago

hink the data you want has been moved

@sakalu, I think the data you want has been moved to this repo: https://github.com/GlobalFishingWatch/training-data @bitsofbits
Hi Tim,

Thank you for your reply. Yes, I do check the repo "https://github.com/GlobalFishingWatch/training-data" and it mentioned that "Vessel tracks and labels are stored separately, and need to be combined for some applications. This is done using a tool included in the repo: ./prepare.sh

So I executed the file "prepare.sh" and a new folder "training-data/data/merged" with 6 files (alex_crowd_sourced.npz, false_positives.npz, kristina_longliner.npz, kristina_ps.npz, kristina_trawl.npz, pybossa_project_3.npz)was generated. I thought they are the training data with labelling (1->Fishing, -1->Non-Fishing). However, I found all of the labelled track data in the folder of merged (such as , kristina_trawl.npz) are labelled with "-1" (none is "1". It is abnormal as they are fishing vessels and so you have any idea about this? Thanks in advance. Regards, Saka

viniciusmonteiro commented 4 years ago

Hi guys,

Wondering where I could find the following json file: messages = models['Logistic'].predict_messages( messages_from_bq_dump("/home/redhog/Downloads/results-20170601-170811.json"))

Or perhaps any other example of the input format for the method predict_proba would be great too. is_fishing = models['Logistic'].predict_proba(track_points)

thanks!

aschneid42 commented 4 years ago

Wondering where can I find classified-filtered.measures.npz and slow-transits.measures.npz? They are not in https://github.com/GlobalFishingWatch/training-data

Npz data in https://github.com/GlobalFishingWatch/training-data has column 'is_fishing' while vessl_scoring code is looking for column 'classification'. How do I convert get 'classification' column or do I simply rename 'is_fishing' to 'classification'?

Thanks in advance for any advice.

@bitsofbits @redhog I'm having the same trouble and wondering if things have moved/changed around a bit in the last couple years...

I was able to run the add_measures.py script to get a data file such as: kristina_longliner.measures.npz, but when I try to load them into Model-Descriptions.ipynb, it complains that they do not have the classification field. Should the "is_fishing" field simply be changed to classification, or is this meant to be the vessel or fishing gear type?

Is there a script to add the classification? I guessed "classify.sh", which tries to call "join_alex_classification.py", but I don't have the inputs: "datasets/alex/tracks.msg" or "datasets/alex/classification-hourlyResultsAll.txt", so it won't run. It looks like this would create the "classified-filtered.npz" to which I could add measures, but it would not add the classification field to the other files...

Any help to get your Model-Descriptions notebook to run would be greatly appreciated, Thanks!

bitsofbits commented 4 years ago

I haven't fired up this code for a while. But I will try to fire to check this out in the next week or so and see if I can figure out what is going on.

On Fri, Aug 21, 2020 at 4:55 PM aschneid42 notifications@github.com wrote:

Wondering where can I find classified-filtered.measures.npz and slow-transits.measures.npz? They are not in https://github.com/GlobalFishingWatch/training-data

Npz data in https://github.com/GlobalFishingWatch/training-data has column 'is_fishing' while vessl_scoring code is looking for column 'classification'. How do I convert get 'classification' column or do I simply rename 'is_fishing' to 'classification'?

Thanks in advance for any advice.

@bitsofbits https://github.com/bitsofbits @redhog https://github.com/redhog I'm having the same trouble and wondering if things have moved/changed around a bit in the last couple years...

I was able to run the add_measures.py script to get a data file such as: kristina_longliner.measures.npz, but when I try to load them into Model-Descriptions.ipynb, it complains that they do not have the classification field. Should the "is_fishing" field simply be changed to classification, or is this meant to be the vessel or fishing gear type?

Is there a script to add the classification? I guessed "classify.sh", which tries to call "join_alex_classification.py", but I don't have the inputs: "datasets/alex/tracks.msg" or "datasets/alex/classification-hourlyResultsAll.txt", so it won't run. It looks like this would create the "classified-filtered.npz" to which I could add measures, but it would not add the classification field to the other files...

Any help to get your Model-Descriptions notebook to run would be greatly appreciated, Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GlobalFishingWatch/vessel-scoring/issues/70#issuecomment-678559481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE6WGNYSYMGFUQ4N56OMDTSB4B4RANCNFSM4CUN3EBA .

GlobalFishingWatch / vessel-scoring

Training dataset #70