Letractively / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

Adding Flash-related file extension support #23

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I would like to see .swf, .mxml, .as, .abc and .fla being added to the 
extensions in netinfo.py 
(maybe in flash_extns) and urlparser.py (is_flash())

Original issue reported on code.google.com by fuk...@gmail.com on 22 Sep 2008 at 11:46

GoogleCodeExporter commented 8 years ago
Lukasz, could you please look at this. It could be simple to fix, by adding
extensions in netinfo.py and adding a few functions to urlparser.py (is_flash) 
etc. 

To reporter:  Which are the flash extensions ? Isn't it only .swf ? 
What about .mxml, .as, .abc etc ? 

I normally do this research myself, but believe me, I am asking due to lack of 
time
doing all this myself! :)

Thanks!

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:29

GoogleCodeExporter commented 8 years ago
- .fla is a source file for Adobes own Flash IDE 
- .mxml is Flash source code which can be compiled to Flex apps (see  
http://en.wikipedia.org/wiki/MXML )
- .as is ActionScript source code (see 
http://en.wikipedia.org/wiki/ActionScript)
- .abc is ActionScript byte code

Original comment by fuk...@gmail.com on 6 Oct 2008 at 11:39

GoogleCodeExporter commented 8 years ago
Thanks for the quick response :)

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:44

GoogleCodeExporter commented 8 years ago
I'll take a look at it.
We could probably add is_flash and is_script, if later is not there yet.

Do we have a sample config.xml that we could use to test if these things get 
downloaded?

Thanks,
Lucas

Original comment by szybal...@gmail.com on 7 Oct 2008 at 2:58

GoogleCodeExporter commented 8 years ago
Is there any parsing that needs to be done on these files for urls? What 
content do
we expect the crawler to get out of it?

If we don't need to parse it and you just want to download it I would propose 
to add
them to document_extns.
http://code.google.com/p/harvestman-crawler/source/browse/trunk/HarvestMan/harve
stman/lib/common/netinfo.py#65
If we don't want to download it as part of documents, then we could create 
option for
script_extns ?

Let me know.
Lucas

.fla is a source file for Adobes own Flash IDE 
.mxml is Flash source code which can be compiled to Flex apps (see 
http://en.wikipedia.org/wiki/MXML )
.as is ActionScript source code (see http://en.wikipedia.org/wiki/ActionScript)
.abc is ActionScript byte code

Original comment by szybal...@gmail.com on 12 Oct 2008 at 5:17

GoogleCodeExporter commented 8 years ago
I will fix this Lukasz. You can take care of any other stuff.

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:28

GoogleCodeExporter commented 8 years ago

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:28

GoogleCodeExporter commented 8 years ago
Fixed in revision 152. Added is_flash method. Flash extensions are .swf, .fla, 
.mxml,
.as and .abc. Added URL_TYPE_FLASH in urltypes.py as subclass of 
URL_TYPE_MULTIMEDIA.
Added unit tests in test_urltypes.py .

Also added <flash ...> element in config.xml as a download type control. By 
default
it is disabled. To download flash, enable it. Tested it on a site with flash 
content
to verify this works.

Lukasz, we don't need a separate function is_script for the time being. Script 
could
mean many things. So for the time being everything comes under is_flash.

fukami, please test it and let me know if the fix is fine.

Marking as fixed. 

Original comment by abpil...@gmail.com on 12 Oct 2008 at 10:22