Airwindopedia sync - Githubissues

alucast commented 2 years ago

Chris has released a page with an updating .txt file with his plugin descriptions so I thought it would be awesome to be able to sync the cheatsheet with this if even possible.

https://www.airwindows.com/wp-content/uploads/Airwindopedia.txt

video: https://youtu.be/WTaFP7Lhj7E

ajboni commented 2 years ago

Hey @alucast thanks! That's a great Idea!! I'll try to include it in the description box, it might be hard because titles are not the exact name as the DB but hopefully the first word matches.

ajboni commented 2 years ago

I think I will rewrite everything to use airwindopedia as main source for the database. It is golden information! I can use Chris categorization for each plugin as well. My only concern is if Chriss decides to change the formatting in the future it will break... Thanks for bringing this

robbiehinch commented 2 years ago

Hi, I'm interested to have a go at this. I was thinking to add a new controller / endpoint that goes to the Airwindowpedia.txt, parses the contents and writes a new database.js that can be committed to the repo. Did you already make a start on this? I like the existing database.js because it is more fine-grained for filtering, and I was thinking to leave as much of that in place as possible, and translate the parts that are missing.

hockinsk commented 2 years ago

I've had a play scraping the airwindows website with importxml using the wp-sitemap-posts-post-1.xml as URL reference of each plugin to scrape each page as Chris does add tags to the plugin pages which are sort of useful. My plan was to use some regex to pull the relevant plugin section from Airwindowpedia.txt rather than the rather cumbersome paragraphs fro the plugins page itself to keep it short in a spreadsheet. Unfortunately, there's some inconsistencies in how Chris writes each plugin page and also importxml is slow as hell as seems to limit how many requests to import can be made as it only builds a few records each time you open the plugin, but once built, that would be enough to keep it updated. Little google sheets example here that should work and describe the above. I'll have a go at pulling the .txt and somehow linking it when I get a chance. I think automatically scraping the website and .txt file is the way to go imo. https://docs.google.com/spreadsheets/d/1ZH5koBSq_ez0xh6fjyOIXjvbUgqRadQIu5-27oA_kmA/edit?usp=sharing

ajboni commented 2 years ago

@hockinsk In my opinion scraping the rest of the website might be a little redundant (we have the WP website for that info!) but if you want to have a go at it maybe its easier to fetch each page in node or python's Beautiful Soup and use that to build a database that then we can use for the UI (and probably you'll be able to use it in gdocs?)

@robbiehinch please do! I've started doing something, but ran out time, I can't work on it until next month unfortunately, but if you want to carry on this is what I have, feel free to use it as starting point or inspiration. The big caveat is that this could break if any big modification is done to the txt

const airWindopediaURL =
  "https://www.airwindows.com/wp-content/uploads/Airwindopedia.txt";

// This could change, we might need a better approach.
const introTextSeparator = "\r\n\r\n\r\n\r\n\r\n\r\n";
const entriesSeparator = "############";
const categoriesDB = {};

init();

async function init() {
  const request = await fetch(airWindopediaURL);
  const content = await request.text();

  const entries = content.split(entriesSeparator);
  const intro = entries.shift();

  const categoriesText = intro.split(introTextSeparator)[1];
  const categoriesRegex = /^([\w]{1}[\w\s\-]+):(.+)$/gm;
  const categoriesMatches = categoriesText.matchAll(categoriesRegex);

  for (const match of categoriesMatches) {
    const category = match[1].trim();
    const entries = match[2].split(",").map((e) => e.trim());
    categoriesDB[category] = entries;
  }

  // TODO: Build the database
  console.log(categoriesDB);
  // we could use the first word of the entry title to map it with the category DB
}

Output:

{
  Ambience: [
    'Galactic',        'Verbity',
    'Chamber',         'Infinity2',
    'TapeDelay2',      'PitchDelay',
    'GlitchShifter',   'NonlinearSpace',
    'BrightAmbience3', 'Infinity',
    'MatrixVerb',      'Melt',
    'PocketVerbs',     'PurestEcho',
    'TapeDelay',       'StarChild',
    'Chorus',          'ChorusEnsemble',
    'Reverb',          'BrightAmbience2',
    'BrightAmbience',  'MV',
    'ADT'
  ],
  'Amp Sims': [
    'GrindAmp', 'FireAmp',
    'LeadAmp',  'LilAmp',
    'MidAmp',   'BigAmp',
    'XRegion',  'Cabs',
    'Golem'
  ],
  Clipping: [
    'ClipOnly2',
    'OneCornerClip',
    'ADClip7',
    'Mackity',
    'Edge',
    'Crystal',
    'AQuickVoiceClip',
    'ClipOnly',
    'Slew3',
    'Slew2',
    'Slew'
  ],
  Consoles: [
    'Console8',
    'PurestConsole2',
    'Console7',
    'Console7Cascade',
    'Console7Crunch',
    'PurestConsole',
    'PDConsole',
    'Console6',
    'Atmosphere',
    'Console5',
    'C5RawConsole',
    'uLaw'
  ],
  Dithers: [
    'Monitoring2',      'Monitoring',
    'Dark',             'PaulWide',
    'PaulDither',       'TPDFWide',
    'TPDFDither',       'NotJustAnotherDither',
    'Beam',             'TapeDither',
    'SpatializeDither', 'VinylDither',
    'DoublePaul',       'Ditherbox',
    'BuildATPDF',       'NodeDither',
    'StudioTan',        'DitherMeTimbers',
    'RawTimbers',       'NaturalizeDither',
    'HighGlossDither',  'DitherFloat'
  ],
  Dynamics: [
    'Pressure5',   'Pop',
    'Logical',     'VariMu',
    'ButterComp2', 'curve',
    'Recurve',     'Pyewacket',
    'BlockParty',  'SoftGate',
    'Thunder',     'Compresaturator',
    'Smooth',      'DrumSlam',
    'BrassRider',  'Point',
    'Gatelope',    'PodcastDeluxe',
    'Podcast',     'TremoSquare',
    'Tremolo',     'Swell',
    'Pressure4',   'Surge'
  ],
  Filter: [
    'Z2 and Z Filters',
    'Capacitor2',
    'Isolator2',
    'DeBess',
    'Holt',
    'Y Filters',
    'Energy2',
    'AverMatrix',
    'Average',
    'MackEQ',
    'Hermepass',
    'Baxandall',
    'Hull',
    'X Filters',
    'Aura',
    'EQ',
    'Hombre',
    'Air2',
    'Capacitor',
    'BassKit',
    'Isolator',
    'TapeFat',
    'Energy',
    'various Biquad variations',
    'ResEQ',
    'DubCenter',
    'DubSub',
    'Preponderant',
    'Lowpass2',
    'Highpass2',
    'DeEss',
    'Floor',
    'FathomFive',
    'Distance2',
    'Distance',
    'Air'
  ],
  'Lo-Fi': [
    'DeRez2',
    'BitGlitter',
    'CrunchyGrooveWear',
    'DeRez',
    'GrooveWear',
    'Cojones',
    'Fracture',
    'Vibrato',
    'Bite',
    'Deckwrecka',
    'DustBunny',
    'Nikola'
  ],
  Noise: [
    'Noise',
    'Texturize',
    'Voice Of The Starship',
    'DarkNoise',
    'ElectroHat',
    'Facet',
    'Fracture',
    'PowerSag2',
    'PowerSag',
    'Silhouette',
    'DustBunny',
    'Gringer'
  ],
  Saturation: [
    'Mackity',     'Tube2',
    'Tube',        'Density2',
    'Spiral2',     'BussColors4',
    'PurestDrive', 'Density',
    'Drive',       'Hard Vacuum',
    'Distortion',  'Focus',
    'Edge',        'Dirt',
    'Mojo',        'Dyno',
    'Loud',        'SingleEndedTriode',
    'Spiral',      'HighImpact',
    'BassDrive',   'Cojones',
    'Fracture',    'NC-17',
    'Unbox',       'Desk4',
    'Facet',       'APIcolypse',
    'Calibre',     'Cider',
    'Crystal',     'Precious',
    'Luxor'
  ],
  Stereo: [
    'StereoFX',
    'Srsly',
    'Srsly2',
    'Wider',
    'StereoChorus',
    'AutoPan',
    'BrightAmbience3',
    'StereoEnsemble',
    'StereoDoubler',
    'TripleSpread',
    'BrightAmbience2',
    'LRFlipTimer'
  ],
  Subtlety: [
    'BitShiftGain',      'PurestGain',
    'PurestFade',        'EveryTrim',
    'HermeTrim',         'Acceleration2',
    'Hype',              'Shape',
    'PurestWarm2',       'PurestWarm',
    'GuitarConditioner', 'Coils2',
    'Interstage',        'SurgeTide',
    'PhaseNudge',        'Hypersonic',
    'HypersonX',         'Ultrasonic',
    'UltrasonX',         'Remap',
    'SingleEndedTriode', 'Coils',
    'Srsly',             'Texturize',
    'Smooth',            'Acceleration',
    'Desk',              'TransDesk',
    'TubeDesk'
  ],
  Tape: [
    'ToTape6',
    'TapeDelay2',
    'FromTape',
    'Tape',
    'IronOxideClassic2',
    'IronOxide5',
    'ToTape5',
    'TapeDelay',
    'IronOxideClassic'
  ]
}

hockinsk commented 2 years ago

The problem with using the .txt file and not the website sitemap is the text file is missing a lot of plugins. I've already found several not listed that do on the website and only at beginning with C. It also misses many of the .dmg downloads for macOS. I think the only way to capture it all is to use the sitemap and then merge in the .txt. I've got something I use myself in a spreadsheet, but there's too much randomness in how Chris does things both on the website and the .txt to code something without taking up several days of your life to do it. Anyway, see how it goes, why Chris uses a 1980's flat text file is beyond me , it's next to useless in terms of usability haha!

robbiehinch commented 2 years ago

I don't know WordPress and wasn't aware of the existence of the sitemap, which makes scraping the WP site much more feasible. @hockinsk I can use your spreadsheet to help me make what should be a much more reliable scraper in javascript. Personally I'm of the opinion that we should scrape both for completeness, take the description from the Airwindowpedia.txt as priority because it seems more useful/concise, and then add as many tags as can be reasonably gathered from either site. @ajboni I can have a go this week. Btw, thank you for creating this site in the first place, I found the AirWindows suite almost impenetrable until I stumbled on this and will be very happy to help enhance it.

hockinsk commented 2 years ago

I've tidied up my own version and shared it. It's not perfect and is only using the .txt file I hacked into a .csv, but one of the sheets has the sitemap links: https://docs.google.com/spreadsheets/d/1r6fFhK_snwJbsy1vteyBN869X5LqX-sI30VA6sgNJo0/edit?usp=sharing

robbiehinch commented 2 years ago

I made some progress on this:

https://github.com/ajboni/airwindows-cheatsheet/pull/30/files

From the draft PR:

I managed to scrape airwindopedia and the wordpress site, and build the output into database.generated.js. It superficially looks to be the right format, but isn't working if I copy it over database.js and run the website. I don't know the svelte framework at all or how to debug it :/

hockinsk commented 2 years ago

Cool, good work. As I can see in your code comments there will be a need for a certain amount of manual override/lookup correction as unfortunatly Chris uses some categories such as Z2 and Z Filters which are just a generic for lots of plugins with that prefix. There's other anomalies such as there's a period in time he started appending '-vst' to the Wordpress pages for new versions, so that makes automatic URL building to the site tricky without a correction lookup too. I think I found around 30 issues of all of the above and manually edited to fit so not too bad. There's a bit of code in the .txt which potentially can cause problems with saving to db I'd think if not handled or accommodated I the encoding etc. I think I have 287 records, so looks good Robbie! https://docs.google.com/spreadsheets/d/1ogoOjH6euBl-OaePJjDvioZkUp1lgOVmIKchX1yBIek/edit#gid=1454804806

ajboni commented 2 years ago

Great job both @hockinsk and @robbiehinch !! I'm getting the impression that even though the airwindopedia exists it may require some heavy post processing / manual work to accommodate it to the right format. And be alert for changes... Maybe its better to use the gdoc spreadsheet as list source, and combine it with the scrapped text from the website, instead of using the txt directly. The issue is that we will depend on the spreadsheet to be manually updated... Opinions?

robbiehinch commented 2 years ago

My initial hubris is suffering a little discouragement. I hadn't anticipated the inconvenience of handling the incompatible text. One way forward could be to write some tests that help pinpoint the differences as quickly as possible and tack some workarounds as a post-processing step. I'm quite pleased with the wordpress scraper at least. I'm a subscriber to the Airwindows patreon, so I wonder if I could mail in a patch explaining the complications and Chris might take it on as a style guide to make this a bit easier for us from the source. The spreadsheet is a great piece of work but it doesn't look continuously maintainable (at least for me).

hockinsk commented 2 years ago

Yeah my spreadsheet was simply a quick thing to please myself, but thought I should share it as it's useful. It's like the cheat sheet though, it needs manually updating, although it's only a case of copy paste the new .txt from Chris, it's still more to do than you want to be doing. I think perhaps presenting Chris with airwindows.txt in a format he can maintain that lends itself to data presentation a bit better might be the way to go to develop something that just continually updates itself from what he enters. I get the feeling he likes things done how he does them though, but we can ask what he thinks?

ajboni commented 1 year ago

So, I went back to this, and found the airwindopedia txt, MUCH cleaner and easier to parse. https://github.com/ajboni/airwindows-cheatsheet/blob/main/src/database.json

I also want to rebuild the frontend with tailwind. So I will start a new site from scratch. Do you guys think that any of your work could help enhance this database, either the gdocs or the wordpress scrapper or maybe its not worth the hassle/manual intervention required?

Any other ideas/feedback from user perspective are welcome!

ajboni / airwindows-cheatsheet-v1

Airwindopedia sync #29