Note: some of the feedback to individual groups applies to everybody.
Note 2: I warmly encourage you to make all your code and useful results open source or otherwise easily available for others. I'm already seeing cool youtube videos, handy libraries, or otherwise interesting stories or visualizations. Spread the word of your great work :)
Terrorism Group
Figure in Sec. 1.2: if there is no real meaning in the coloring grading, do not use it. It leads to believe that it represents another variable.
Figure in Sec. 1.2: "However, we can see that from 2014 to 2015 this number decreased." that's not the most obvious thing that stands out first from the figure. That sentence would require more information. Furthermore, it's very difficult or impossible to extrapolate a trend form only those 2 years, since in all previous years the attacks decreased. BTW, do you have ALL attacks of 2015 in your dataset ? Do you have attacks of 2016 ?
Make sure your figures are perfectly visible and that the axes are well explained. For example, Fig. Sec1.2, axis y: count? Spell out sth like: "terrorist attacks, count".
Adding to the last point, Figures/Tables should be self contained and stand by themselves. Always put a caption explaining what we see or should see and cite the figure/table in the main text.
Love the youtube videos. They only go a bit too fast and so are difficult to follow. Perhaps more gradual transitions would ease the interpretation. Also, the legend of used colors is too tiny to notice and by themselves the meaning of the colors is not obvious. Plus, what does the size mean then? That is, explain all your visual cues.
Good finding on the apparent error on labels
Figure "Type of Attacks evolving since 1970", again the legend is too small to read
...didn't go in detail until...:
Figure "Number of Kills vs Attack Type" why not just write the attack types labels on the x axis? Very difficult to read, even more considering that the odd but unwritten numbers are actually valid values.
Last Figure and all: always provide an explanation --> what do we see here?
Figure in Sec 2.2 --> I know what it means, but again, write out all axes meanings -- x axis?
AND Always sort frequency histograms. It would make the reading that much easier. Example: from the figure alone, what are the top 3 features with most missing values? Very difficult to answer if it's not sorted.
World Food Facts
Why not put a link to the website of the dataset? http://world.openfoodfacts.org -- Please do so -- OK, now I see it at the very end. Just link in the beginning too.
Why not briefly present the website to your colleagues in the presentation? It would make that much easier to visualize what data you're handling with and much more exciting!
Hypothesize: why is most of the data coming from France, why this bias? Hint... :
The Open Food Facts web site is published by the non-profit organization Open Food Facts (French "Loi 1901" Association).
Address: 21 rue des Iles, 94100 Saint-Maur des Fossés, France
Your definition of important attributes requires more explanation. It's only based on the fact that the "unimportant" features are missing? May there be features with many missing values that are are nonetheless "important"?
Please separate your paragraphs / sections better. It was confusing that "We also managed to remove words stopword(English), punctuation in "allergens" attribute." immediately followed the previous figure.
Why is the word "English" appearing in the allergens attribute?
Would be nice to read 2-3 examples of the allergen attributes
And why focus on the attribute allergen in the first place? That is, why no others?
Are those libraries form python?
I personally find difficult to read your figures expressing the relationship between atributes
Expressing energy in KJ is actually perfectly correct. Nonetheless, people are mostly accustomed to dietary calories, and consequently putting both KJ and cals helps the audience
Put in some history and color to Fig in Sec 1.3 -- give some examples of additives and allergens. Besides the obvious stats you see: can you explain what's going on or make a story out of this? Why those 6 countries seem to be outliers?
Google N Grams
Since you focus on explaining the python scripts, would be nice to have a link to them.
Is the word "enter" an artifact of introductory sentences like "Character X entered the scene..." ?
Do you want to have the words "would" or "shall" in your top 10?
To make sure, did you "lemmatize" the words (and not only "stemmed" them ?) In that case, would should have not appeared, but rather "will".
What are these seemingly weird peaks at the very end of every frequency graph? Always interpret and explain your figures/tables.
To me, nonetheless, this indicates that the grams counts should be normalized to the years for this type of analysis. I assume (correct me if I'm wrong) that older years had a lower count of grams just because of the availability of written resources. In that respect, this type of analysis would be interesting to see: e.g. growth of total grams over time. That I guess should show an exponential growth.
UK Crimes
"The number of points of interest and stops by the police in the near of a crime differs significantly between the crime types as well as between the outcome types. E.g. thefts from the person mostly happen at places with a lot of points of interest." --> correct the grammar a bit
" we are in contact with the police of UK." -- hahaha this definitely requires more explanation -- you contact the police in the UK and this is all you have to say?? Did they answer? Did they provide useful information? Were they nice? Are they allowed to provide any more data on this? Come on guys!!
"we contacted the UK Police and asked..."
"in cooperation with the UK police..."
So the "zeroed out" coordinates: how many of those instances do we have?
Can you show on a map the possible places where coordinates are zeroed out? You would just have to pin the master 750k points, draw a radius of 20km around them, then everything else should be the "dark zone" ? Are my assumptions correct?
Liked very much that you included source references for your statements.
Sorry, what is "ASB" ? Don't assume the reader knows: explain.
CPS ?
"success rates between 19% and 97% depending on where in the country the crime happened" -- to my knowledge, there are only 4 countries. Does this mean that we have a success rate estimation for each individual country?
What is "number poi"? what is "person searches?" besides the obvious ?
Explain what LSOA is, -- POI, ... btw I know the meaning of most of these. It's only that you cannot assume that for a new reader. For every new test (e.g. every new wiki page), spell out the abbreviation the first time you use it, then use the abbreviation.
Liked that you wrote a work log.
Climate Change
Always define your terms or abbreviations (PRCP, Tmax, PCA, ...), specially if they are written in a summary
You need to give many more details on those other 7 new datasets: where do they come from? why those? ...
Any news on the releasing your python parsing library? :)
Figures temperatures measured in... ? Put that in the axes labels. Furthermore, to me at least, it's very difficult to interpret and relate sth like 160 to Celsius -- Can you come up with another format for the tenths of Celsius degrees ?
You could easily merge the Tmax & Tmin graphs into one to be able to compare them. Separated, as you wrote different scales for the y axes, they are impossible to compare with one another.
I liked that you gave explanations to your figures.
"It is observed that there is a trend towards the shrinking of the seasons, as spring, summer and autumn temperatures approach each other over the years." -- that's not obvious to me -- Can you defend your statement with real numbers or is this a gut feeling?
Is there really such a drastic drop of average temperatures in winter compared to the others even to autumn ?
AFAIK, seasons in Delhi are not as in central Europe. For example, what about the rainy season and what months does it span over ?
Put exact measures, labels, and captions in all your figures/captions
Figures of CO2 emissions -- again if you show them together and using the same scales they become that much easier to compare
Is the apparent stagnation of Co2 emissions in the last years real or an artifact of the dataset maybe just because they are missing those values for last years? Otherwise, this is a very very intriguing and interesting finding. That's why it requires more exploration.
Great that you added references. Why didn't you reference them in the text?
London RE
Great that you made available your dataset as a link
"higher than the general average" -- what is the general average? And, you mean average of whole London area or whole England, or whole UK ?
Nice map visualizations. By applying some color grading, it would be easier to spot the concentration of RE offerings
A pareto chart of the agents/offices would be useful
Let's try to do more correlation between attributes by next week
From the zoopla ad lol, I got the idea that the API can also return crime data ?
Note: some of the feedback to individual groups applies to everybody.
Note 2: I warmly encourage you to make all your code and useful results open source or otherwise easily available for others. I'm already seeing cool youtube videos, handy libraries, or otherwise interesting stories or visualizations. Spread the word of your great work :)
Terrorism Group
Figure in Sec. 1.2: if there is no real meaning in the coloring grading, do not use it. It leads to believe that it represents another variable.
Figure in Sec. 1.2: "However, we can see that from 2014 to 2015 this number decreased." that's not the most obvious thing that stands out first from the figure. That sentence would require more information. Furthermore, it's very difficult or impossible to extrapolate a trend form only those 2 years, since in all previous years the attacks decreased. BTW, do you have ALL attacks of 2015 in your dataset ? Do you have attacks of 2016 ?
Make sure your figures are perfectly visible and that the axes are well explained. For example, Fig. Sec1.2, axis y: count? Spell out sth like: "terrorist attacks, count".
Adding to the last point, Figures/Tables should be self contained and stand by themselves. Always put a caption explaining what we see or should see and cite the figure/table in the main text.
Love the youtube videos. They only go a bit too fast and so are difficult to follow. Perhaps more gradual transitions would ease the interpretation. Also, the legend of used colors is too tiny to notice and by themselves the meaning of the colors is not obvious. Plus, what does the size mean then? That is, explain all your visual cues.
Good finding on the apparent error on labels
Figure "Type of Attacks evolving since 1970", again the legend is too small to read
...didn't go in detail until...:
Figure "Number of Kills vs Attack Type" why not just write the attack types labels on the x axis? Very difficult to read, even more considering that the odd but unwritten numbers are actually valid values.
Last Figure and all: always provide an explanation --> what do we see here?
Figure in Sec 2.2 --> I know what it means, but again, write out all axes meanings -- x axis?
AND Always sort frequency histograms. It would make the reading that much easier. Example: from the figure alone, what are the top 3 features with most missing values? Very difficult to answer if it's not sorted.
World Food Facts
Why not put a link to the website of the dataset? http://world.openfoodfacts.org -- Please do so -- OK, now I see it at the very end. Just link in the beginning too.
Why not briefly present the website to your colleagues in the presentation? It would make that much easier to visualize what data you're handling with and much more exciting!
Hypothesize: why is most of the data coming from France, why this bias? Hint... :
Your definition of important attributes requires more explanation. It's only based on the fact that the "unimportant" features are missing? May there be features with many missing values that are are nonetheless "important"?
Please separate your paragraphs / sections better. It was confusing that "We also managed to remove words stopword(English), punctuation in "allergens" attribute." immediately followed the previous figure.
Why is the word "English" appearing in the allergens attribute?
Would be nice to read 2-3 examples of the allergen attributes
And why focus on the attribute allergen in the first place? That is, why no others?
Are those libraries form python?
I personally find difficult to read your figures expressing the relationship between atributes
Expressing energy in KJ is actually perfectly correct. Nonetheless, people are mostly accustomed to dietary calories, and consequently putting both KJ and cals helps the audience
Put in some history and color to Fig in Sec 1.3 -- give some examples of additives and allergens. Besides the obvious stats you see: can you explain what's going on or make a story out of this? Why those 6 countries seem to be outliers?
Google N Grams
Since you focus on explaining the python scripts, would be nice to have a link to them.
Is the word "enter" an artifact of introductory sentences like "Character X entered the scene..." ?
Do you want to have the words "would" or "shall" in your top 10?
To make sure, did you "lemmatize" the words (and not only "stemmed" them ?) In that case, would should have not appeared, but rather "will".
What are these seemingly weird peaks at the very end of every frequency graph? Always interpret and explain your figures/tables.
To me, nonetheless, this indicates that the grams counts should be normalized to the years for this type of analysis. I assume (correct me if I'm wrong) that older years had a lower count of grams just because of the availability of written resources. In that respect, this type of analysis would be interesting to see: e.g. growth of total grams over time. That I guess should show an exponential growth.
UK Crimes
"The number of points of interest and stops by the police in the near of a crime differs significantly between the crime types as well as between the outcome types. E.g. thefts from the person mostly happen at places with a lot of points of interest." --> correct the grammar a bit
" we are in contact with the police of UK." -- hahaha this definitely requires more explanation -- you contact the police in the UK and this is all you have to say?? Did they answer? Did they provide useful information? Were they nice? Are they allowed to provide any more data on this? Come on guys!!
So the "zeroed out" coordinates: how many of those instances do we have?
Can you show on a map the possible places where coordinates are zeroed out? You would just have to pin the master 750k points, draw a radius of 20km around them, then everything else should be the "dark zone" ? Are my assumptions correct?
Liked very much that you included source references for your statements.
Sorry, what is "ASB" ? Don't assume the reader knows: explain.
"success rates between 19% and 97% depending on where in the country the crime happened" -- to my knowledge, there are only 4 countries. Does this mean that we have a success rate estimation for each individual country?
What is "number poi"? what is "person searches?" besides the obvious ?
Explain what LSOA is, -- POI, ... btw I know the meaning of most of these. It's only that you cannot assume that for a new reader. For every new test (e.g. every new wiki page), spell out the abbreviation the first time you use it, then use the abbreviation.
Liked that you wrote a work log.
Climate Change
Always define your terms or abbreviations (PRCP, Tmax, PCA, ...), specially if they are written in a summary
You need to give many more details on those other 7 new datasets: where do they come from? why those? ...
Any news on the releasing your python parsing library? :)
Figures temperatures measured in... ? Put that in the axes labels. Furthermore, to me at least, it's very difficult to interpret and relate sth like 160 to Celsius -- Can you come up with another format for the tenths of Celsius degrees ?
You could easily merge the Tmax & Tmin graphs into one to be able to compare them. Separated, as you wrote different scales for the y axes, they are impossible to compare with one another.
I liked that you gave explanations to your figures.
"It is observed that there is a trend towards the shrinking of the seasons, as spring, summer and autumn temperatures approach each other over the years." -- that's not obvious to me -- Can you defend your statement with real numbers or is this a gut feeling?
Is there really such a drastic drop of average temperatures in winter compared to the others even to autumn ?
AFAIK, seasons in Delhi are not as in central Europe. For example, what about the rainy season and what months does it span over ?
Put exact measures, labels, and captions in all your figures/captions
Figures of CO2 emissions -- again if you show them together and using the same scales they become that much easier to compare
Is the apparent stagnation of Co2 emissions in the last years real or an artifact of the dataset maybe just because they are missing those values for last years? Otherwise, this is a very very intriguing and interesting finding. That's why it requires more exploration.
Great that you added references. Why didn't you reference them in the text?
London RE
Great that you made available your dataset as a link
"higher than the general average" -- what is the general average? And, you mean average of whole London area or whole England, or whole UK ?
Nice map visualizations. By applying some color grading, it would be easier to spot the concentration of RE offerings
A pareto chart of the agents/offices would be useful
Let's try to do more correlation between attributes by next week
From the zoopla ad lol, I got the idea that the API can also return crime data ?