Name of some subcorpora are abbrevated

acdh-oeaw / dylen-backend

1 stars 0 forks source link

Name of some subcorpora are abbrevated #9

Closed sbyim closed 3 years ago

sbyim commented 3 years ago

It's hard to guess what the actual media source is.

anyaat commented 3 years ago

These are the names used in the AMC corpus data. As we agreed at the tool meeting in October, we are going to have just a few sources, and 9 ones that are chosen for now are not abbreviated.

sbyim commented 3 years ago

What are the 9 sources? I've just imported the data as you've generated and shared, no filtering is done on the tool side yet.

sbyim commented 3 years ago

I can't find anything regarding the sources in the meeting protocol, could you post it here? If that's the case, is it possible to generate networks only for those corpus/subcorpus of interest?

anyaat commented 3 years ago

Initially i was asked to use all sources for the tool. I proposed not to keep most of them (which are of a small size and little relevance) so that we can afford more target words. But we then decided to keep them for the trial so that Tanja and Andreas would choose the sources based on the analysis. The November trial didn't happen, and i was told it would not be possible to conduct it using the tool in the near future. Therefore, we are trying to estimate the parameters without the tool. Once the basic parameters are finalised, i will compute the data for 2000 words which could be used for the tool development.

anyaat commented 3 years ago

Here you can find some updates to the project plan

sbyim commented 3 years ago

Thx for the updates, clearly i wasn't informed about these decisions and imported all of the November data into the tool, but the work should have been done anyway and I will filter out the abbreviated data on the tool side for now.

sbyim commented 3 years ago

ego networks will be generated for selected subcorpora without abbreviation only