Closed GoogleCodeExporter closed 9 years ago
Lukasz, could you please look at this. It could be simple to fix, by adding
extensions in netinfo.py and adding a few functions to urlparser.py (is_flash)
etc.
To reporter: Which are the flash extensions ? Isn't it only .swf ?
What about .mxml, .as, .abc etc ?
I normally do this research myself, but believe me, I am asking due to lack of
time
doing all this myself! :)
Thanks!
Original comment by abpil...@gmail.com
on 6 Oct 2008 at 11:29
- .fla is a source file for Adobes own Flash IDE
- .mxml is Flash source code which can be compiled to Flex apps (see
http://en.wikipedia.org/wiki/MXML )
- .as is ActionScript source code (see
http://en.wikipedia.org/wiki/ActionScript)
- .abc is ActionScript byte code
Original comment by fuk...@gmail.com
on 6 Oct 2008 at 11:39
Thanks for the quick response :)
Original comment by abpil...@gmail.com
on 6 Oct 2008 at 11:44
I'll take a look at it.
We could probably add is_flash and is_script, if later is not there yet.
Do we have a sample config.xml that we could use to test if these things get
downloaded?
Thanks,
Lucas
Original comment by szybal...@gmail.com
on 7 Oct 2008 at 2:58
Is there any parsing that needs to be done on these files for urls? What
content do
we expect the crawler to get out of it?
If we don't need to parse it and you just want to download it I would propose
to add
them to document_extns.
http://code.google.com/p/harvestman-crawler/source/browse/trunk/HarvestMan/harve
stman/lib/common/netinfo.py#65
If we don't want to download it as part of documents, then we could create
option for
script_extns ?
Let me know.
Lucas
.fla is a source file for Adobes own Flash IDE
.mxml is Flash source code which can be compiled to Flex apps (see
http://en.wikipedia.org/wiki/MXML )
.as is ActionScript source code (see http://en.wikipedia.org/wiki/ActionScript)
.abc is ActionScript byte code
Original comment by szybal...@gmail.com
on 12 Oct 2008 at 5:17
I will fix this Lukasz. You can take care of any other stuff.
Original comment by abpil...@gmail.com
on 12 Oct 2008 at 7:28
Original comment by abpil...@gmail.com
on 12 Oct 2008 at 7:28
Fixed in revision 152. Added is_flash method. Flash extensions are .swf, .fla,
.mxml,
.as and .abc. Added URL_TYPE_FLASH in urltypes.py as subclass of
URL_TYPE_MULTIMEDIA.
Added unit tests in test_urltypes.py .
Also added <flash ...> element in config.xml as a download type control. By
default
it is disabled. To download flash, enable it. Tested it on a site with flash
content
to verify this works.
Lukasz, we don't need a separate function is_script for the time being. Script
could
mean many things. So for the time being everything comes under is_flash.
fukami, please test it and let me know if the fix is fine.
Marking as fixed.
Original comment by abpil...@gmail.com
on 12 Oct 2008 at 10:22
Original issue reported on code.google.com by
fuk...@gmail.com
on 22 Sep 2008 at 11:46