[x] Run crawler to detect 1000 potential registration forms
[x] Algorithm for extraction of registration forms features
[ ] Registration form classification (dependent on Training reg. forms dataset collection)
[ ] Data pre-processing: cleaning, language features embedding
[ ] Modeling
[ ] Using the output of classification in the crawler (this connection is more challenging than it seams)
[ ] Email classification (dependent on collection of Training reg. forms dataset and Mail labeling)
[ ] Features analysis and extraction
[ ] Classification
[x] Email registration confirmation
[x] Finish registration process (e.g., clicking confirmation links, using registration code)
Study
[x] Pilot study
[x] What aspects are interesting?
[ ] Training registration forms dataset collection (depends on Crawler orchestration and Running crawler)
[ ] Can we collect 1000 registration forms?
[x] Processing corresponding emails
[ ] Final study
[ ] In ideal case, we can find all types of violations automatically. Then this study analyses a sample to confirm rate of false positives and false negatives
[ ] If the automation is not that successful, we have to use the orchestration. Can we do 10k registrations?
Writing
[ ] Analyze the following research questions:
[x] Are email addresses shared with third parties?
[ ] Where do the spammers get the email addresses?
[ ] What ratio of services sends unsolicited mail? Are they smaller or larger companies?
[ ] What services force user to accept newsletters? Are they smaller or larger companies?
[ ] Are the registration forms themselves compliant (pre-accepted T&C/PP)?
Implementation
Study
Writing