Closed ansjin closed 7 years ago
@ansjin are all files in the provided WET like that? How many percent are different/better? Or is it just 100% rubbish? The presence of rubbish is not surprising, but it should not prevent us from extracting meaningful relations from acceptable sources.
The point is, even if 95% of the input texts is rubbish, this shouldn't matter - the remaining 5% should deliver enough signal for meaningful results, which can be later separated from the rubbish relations (e.g. using simple counting).
@vviro Yeah the file is like that. It looks like its from some blogs which has lots of comments and unnecessary stuff which is Kinda unrelated for algorithms. Please see attached WET file input.txt
I will in some time post the complete output of our algorithm here and see if any meaning full data exists or not . As algorithm cannot run on our PC, so I am hosting it on Azure and run.
@ansjin So the claim is that this WET file contains only relevant texts? If so, maybe the filter needs to be tweaked to exclude different classes of obviously irrelevant pages (if they are relatively easy to identify).
@vviro yeah we need to build or find a filter which can exclude such pages or I think it would be better that we don't have such websites scrapped and included in WET files. Better thing is the filter can be added while scrapping websites only.
765 scrapped wiki pages can be found here: https://github.com/MusicConnectionMachine/UnstructuredData/issues/65 I hope it helps ;)
We (group 2) are stating to test the complete pipeline. As soon as the DB is up and running we can provide more data.
Check the CSV file here https://github.com/MusicConnectionMachine/api/issues/22#issuecomment-292573343
Given to us by group2. So this issue can be closed.
We parsed the sample WET file provided by you here https://github.com/MusicConnectionMachine/UnstructuredData/issues/58.
The file looks something like this:
This WET file contains a lot of user comments and unnecessary data not the real data of any composer. This data looks mostly from blogs.
How can we find relationships or any timeline events from this file ?
If you have any new data extracted, can you share it with us?@MusicConnectionMachine/group-2 Preferably the scrapped Wikipedia page of a composer!!
@vviro our algorithms cannot detect anything from this data. The output of one algorithm from this text is :
{ start: '1', end: '2014', event: 'Sort by YearTitleType Order AscDesc Displaying 1 - 1 of 1Export 1 results:BibTexTaggedXML 2014 H.' }, { start: '2014', event: '709-736, 2014.' } ] example-output.txt