NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Prevent extra delimiter on sequence types in listToDelimitedTextFile #311

Closed bsidhom closed 10 years ago

bsidhom commented 10 years ago

When a DList of certain sequence types is persisted using list.toDelimitedTextFile, an additional delimiter is appended to the output. For example:

val list: DList[List[Int]] = DList(List(1, 2, 3), List(4, 5))
list.toDelimitedTextFile("output.csv", "|").persist

results in an "output.csv" file consisting of

1|2|3|
4|5|

This is due to the implementation of anyToString relying on productIterator instead of directly mapping elements and using mkString if the container already provides them.

A possible workaround is to make an additional check at the beginning of anyToString to see if it is a sequence type. For example:

def anyToString(any: Any, sep: String): String = any match {
  case seq: Seq => seq.map(anyToString(_, sep)).mkString(sep)
  case prod: Product => prod.productIterator.map(anyToString(_, sep)).mkString(sep)
  case _ => any.toString
}

Of course, other types may need to have special cases as well. This also makes arrays and other sequence types that don't extend Product expand recursively as well. Of course, if the outermost container is an array, it still won't be expanded as listToDelimitedTextFile only accepts products. Because of this, it may or may not make sense to make anyToString aware of other types or even address the issue with lists.

etorreborre commented 10 years ago

I think this issue is the result of List being treated as a special product of size 2 where the first element is the list head and the second element the list tail:

> List(1, 2, 3, 4).productArity
> 2