Open ZeroDot1 opened 3 years ago
There exist tools for it already:
ZeroDot1 : Add a URL/Domain extractor.
ZeroDot1 : Add a search function
There exist tools for it already:
Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream. In addition, most tools are very limited e.g. by a size limitation of input files.
ZeroDot1 : Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream.
- this one looks good
Sorry, no the tool does not work. I have tested it with different text inputs. From a HTML page with over 95 links only 32 URLs/Domains were extracted.
ZeroDot1 : Sorry, no the tool does not work.
It does work very well, but it extracts domains only and not URLs, I just missed you wanted to extract not only domains but also URLs.
Then there are several browser addons I use myself to extract domains/URLs from a webpage: https://addons.mozilla.org/pl/firefox/addon/web-link-extractor/ https://addons.mozilla.org/pl/firefox/addon/link-gopher/ https://chrome.google.com/webstore/detail/link-gopher/bpjdkodgnbfalgghnbeggfbfjpcfamkf https://chrome.google.com/webstore/detail/link-grabber/caodelkhipncidmoebgbbeemedohcdma?hl=pl
It does work very well,
All this is not what I would need. Your suggested tools are nothing more than online helpers. I need tools that work completely offline. I can't process 2GB text files with any of your suggestions.
ZeroDot1 : 2GB text files
Wow.
Are you sure you don't want to split the file into smaller chunks?: https://stackoverflow.com/questions/18208524/how-do-i-read-a-text-file-of-about-2-gb https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files
Also this is a good tool I use personally sometimes: https://www.digitalvolcano.co.uk/textcrawler.html
What does it do?
TextCrawler is a fantastic tool for anyone who works with text files. This powerful program enables you to instantly find and replace words and phrases across multiple files and folders. It utilises a flexible Regular Expression engine to enable you to create sophisticated searches, preview replace, perform batch operations, extract text from files and more. It is fast and easy to use, and as powerful as you need it to be.
I can't process 2GB text files with any of your suggestions.
:laughing: :rofl:
Try Linux :stuck_out_tongue_winking_eye:
This said, there is a python module that can do this, since you don't have (e)grep available out of the box. My question to you (@ZeroDot1) is, are your source files in any kind of "standard" formatting? or would it be better to convert them first into some std format, which then can be processed by pyfunceble??
Yeah, I'm curious as well, how did he end up with a 2GB text file...
I just can't imagine how can a single 2GB text file ever be created in normal conditions. I don't think a lot of people need this as it is rather a premium feature request needed by individuals for specific tasks, nothing that many people would benefit from, however I'm always open mind.
spirillen : Try Linux
Perhaps you were just joking, anyway, it seems he didn't like to use Linux:
ZeroDot1 : https://github.com/funilrys/PyFunceble/issues/234#issue-852389121 : (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).
I use Linux most of the time. Yes, this is really a very special function, but I think this function will be very useful and helpful me and for others.
The data are completely mixed files that are combined into one file. The function should simply be able to read all text independent of the file format, because it makes very little sense with an extraction function to limit the function to specific file formats.
I think with just a few changes to PyFunceble the functions should be easily possible.
With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.
At the moment the function is simply limited to RAW files.
Try Linux
I use Linux since 1999.
By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D
As for domains extraction:
As I said in https://github.com/funilrys/PyFunceble/issues/13#issuecomment-797590637 : in case of extracting domains in Adblock Decoder, "Decode everything" mode, will give too many useless false positives which will clutter the output list, making the output a garbare dump.
There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.
As for URLs extraction:
Can have false hits as well: https://pypi.org/project/urlextract/ https://mathiasbynens.be/demo/url-regex
The only solution seems to be to extract everything, and to leave all false hits/garbare as an user's issue to deal with. Paraphrasing WYSIWYG ==> WYGIWYG (What You Give is What You Get)
Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml, I've tested on easylist and:
Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,
I can not use this tool, I use Linux as a system.
I don't have a VM at the moment because I don't have any free space and I can't buy a new SSD at the moment.
I have somewhere in my archive also a self-programmed software in C# to extract all possible URLs. I can't use that right now, but a solution that works directly with Linux and command line is best. I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.
@funilrys What do you think about this idea would it be possible?
And what about Wine?
ZeroDot1 : I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.
It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:
@ZeroDot1 wrote:
Try Linux
I use Linux since 1999.
By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D
I know, I most had been tired as I confused you with someone else :smiley: :sleepy: You are on Arch I know...
@ZeroDot1 wrote: With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.
Sounds like an integration of BeautifulSoup could come in handy!!!
This said, I do understand why you (@ZeroDot1) would like to integrate it into @pyfunceble directly.
This is not an objection, but a thought of the big picture, would it be more handy to write this as a individual code that can extract all urls/domains from any source? Why: I could use such tool to extract urls/Domains when I'm working on the Adult Contents project, saving me a bunch of time from extracting sources through @gorhill's uBlock Origin logger. This tool could then be using the @pyfunceble API to test on the fly.
What do you other think? @ZeroDot1 @keczuppp @mitchellkrogza @funilrys
@keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:
Not true if you are using proper code bases
spirillen : Sounds like an integration of BeautifulSoup could come in handy!!!
I saw it before, but I didn't mention about it coz it supports only HTML or XML, which is not the case the OP requested, he requested to extract from any text.
keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:
spirillen : Not true if you are using proper code bases
I already mentioned about it before (wrote "might / can" instead of "will") as mentioned before:
keczuppp : https://github.com/funilrys/PyFunceble/issues/234#issuecomment-817182667 : There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.
keczuppp : https://github.com/funilrys/PyFunceble/issues/234#issuecomment-817182667 : Can have false hits as well
You quoted my statement out of context, in my comment https://github.com/funilrys/PyFunceble/issues/234#issuecomment-818863586, above my statement is a quote to which my statement refers, which means I was reffering to the quote and not talking generally, and the quote says about: "ZeroDot1 : completely mixed text." where "mixed" most likely means "random", which is opposite to "spirillen: use proper (prepared/custom) code base". Hence I can't agree with you saying "not true" in this case.. But yeah, if not reffering to the quote, and when talking generally, it is possible to avoid false hits if you cherry pick the input content (about what I already mentioned before.)
spirillen : What do you other think?
Maybe it can be like Adblock Decoder, it can exists in both ways: as integrated into PyFunceble and as standalone
Hey @ZeroDot1
As I'm re-reading your suggestion, I sitting here and thinking: What would you prefer?
Related to: Adult Contents submit program #python, New url location: https://mypdns.org/my-privacy-dns/porn-records/-/issues/59
The search "function" won't be my priority. Other tools should be able to handle it better.
But some dedicated tools which proxies our internal decoders (like the adblock-decoder) may be provided in the future.
Let's keep this open.
btw, the PyFunceble Web Worker project, provides some endpoints for the decoding/conversion of inputs.
It basically exposes the (internal) converter of PyFunceble behind a web server / API. I can't and don't want to hosts such a service (yet) but it can be a good alternative for some people ... I'm still ready to fix the issues reported there though.
Add a URL/Domain extractor. With this function it should be possible to extract all URLs/domains from any text and save them to a file so that they can be easily checked at a later time without significant time and effort. Simply a useful function for blacklist developers.
Add a search function (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program). With the search function it should be possible to search all URLs/domains with a given string in a file and save it to another file, and it should be possible to search multiple keywords by comma separation directly after each other. This function is also very helpful for blacklist developers.