compbiocore / VariantVisualization.jl

Julia package powering VIVA, our tool for visualization of genomic variation data. Manual:
https://compbiocore.github.io/VariantVisualization.jl/stable/
Other
85 stars 13 forks source link

ERROR: LoadError: ArgumentError: column name :0.0 not found in the data frame #91

Closed mjmontague closed 4 years ago

mjmontague commented 4 years ago

I'm running the following command:

docker run -it --rm -v "$PWD":/data compbiocore/viva-cli viva --save_remotely -f SFARI_priority.vcf -t Grouped_by_Sequencing_Site -g Book9.csv seq_site_1,seq_site_2 -o output_3

I imagine the error involves the vcf, but I can't figure it out. Does the order of the sample names in the vcf need to match the order of the sample names in the metadata file?

Here's a snippet of my vcf:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01K 03J 04T

1 91127495 . T A

And the first 3x3 of my metadata csv: id,03J,04T, "seq_site_1,seq_site_2",2,2 "case,control",1,2,

Here's the full error message: Finished loading packages!

Reading SFARI_priority.vcf ...

No filters applied. Large vcf files will take a long time to process and heatmap visualizations will lose resolution at this scale unless viewed in interactive html for zooming.

Loading VCF file into memory for visualization Selected 84 variants with no filters applied ┌ Warning: DataFrame(t::Type, nrows::Integer, ncols::Integer) is deprecated, use DataFrame([Vector{t}(undef, nrows) for i = 1:ncols]) instead. │ caller = generate_genotype_array(::Array{Any,1}, ::String) at vcf_utils_complete.jl:664 └ @ VariantVisualization ~/.julia/packages/VariantVisualization/1yoNl/src/vcf_utils_complete.jl:664

Grouping samples by seq_site_1,seq_site_2 ERROR: LoadError: ArgumentError: column name :0.0 not found in the data frame Stacktrace: [1] lookupname at /root/.julia/packages/DataFrames/VrZOl/src/other/index.jl:233 [inlined] [2] (::getfield(DataFrames, Symbol("##25#26")){DataFrames.Index})(::Symbol) at ./none:0 [3] iterate at ./generator.jl:47 [inlined] [4] collect_to! at ./array.jl:651 [inlined] [5] collect_to_with_first!(::Array{Int64,1}, ::Int64, ::Base.Generator{Array{Symbol,1},getfield(DataFrames, Symbol("##25#26")){DataFrames.Index}}, ::Int64) at ./array.jl:630 [6] collect(::Base.Generator{Array{Symbol,1},getfield(DataFrames, Symbol("##25#26")){DataFrames.Index}}) at ./array.jl:611 [7] getindex at /root/.julia/packages/DataFrames/VrZOl/src/other/index.jl:245 [inlined] [8] #select#122 at /root/.julia/packages/DataFrames/VrZOl/src/dataframe/dataframe.jl:825 [inlined] [9] #select at ./none:0 [inlined] [10] getindex(::DataFrames.DataFrame, ::Colon, ::Array{Symbol,1}) at /root/.julia/packages/DataFrames/VrZOl/src/dataframe/dataframe.jl:401 [11] sortcols_by_phenotype_matrix(::String, ::String, ::Array{Int64,2}, ::Array{Symbol,2}) at /root/.julia/packages/VariantVisualization/1yoNl/src/vcf_utils_complete.jl:858 [12] top-level scope at /usr/local/bin/viva:444 [13] include at ./boot.jl:326 [inlined] [14] include_relative(::Module, ::String) at ./loading.jl:1038 [15] include(::Module, ::String) at ./sysimg.jl:29 [16] exec_options(::Base.JLOptions) at ./client.jl:267 [17] _start() at ./client.jl:436 in expression starting at /usr/local/bin/viva:407

gtollefson commented 4 years ago

@mjmontague Thanks for posting all of the relevant info. Let's get this sorted!

It looks like the VCF is formatted OK. It looks like there is a sample named ":0.0" or "0.0" in either the VCF or the metadata file, but not both. Do you see this sample?

Sample ids in the metadata file must match those found in the VCF file, but do not need to be in the same order as they appear in the VCF header.

I'll keep an eye out for your response. -George

mjmontague commented 4 years ago

Thanks for the quick reply George -

That was my assumption, too, but I don't see a sample named ":0.0" or "0.0" in the VCF or metadata. I have a sample with ID "K00" but get the same error if I change the ID to "K01" in both files.

Troubleshooting thoughts:

  1. The VCF is for rhesus macaque variants, but the chromosomes are numbered 1-20.
  2. There are a large number of commented-out rows in the VCF, but I don't think those would interfere with the processes.
  3. Since the file is small, I ran it through vcftools using the -recode option to see if a recoded vcf would work, but no luck.

I'll be happy to share the files with you if you think that would help

gtollefson commented 4 years ago

No problem. We're happy that you're using our tool!

Ok, that's interesting. Do you get the same error when running the first ~500 lines of the VCF file with something like head -n 500 my.VCF > test.VCF? Can you try that and if it replicates the error, can you send the truncated VCF file and the complete metadata file?

mjmontague commented 4 years ago

The VCF file is 3084 lines but I only have 84 variants in my test file. Interestingly, the following command works with the unaltered VCF and I get the HTML output, Read_Depth_SFARI_priority.vcf.html and Genotype_SFARI_priority.vcf.html:

docker run -it --rm -v "$PWD":/data compbiocore/viva-cli viva --save_remotely -f SFARI_priority.vcf -o output_2

Most of the rows in the VCF are for contig IDs, but for which variants were removed:

contig=

contig=

If I remove these extraneous rows manually in BBEdit and re-run this command:

docker run -it --rm -v "$PWD":/data compbiocore/viva-cli viva --save_remotely -f SFARI_priority.vcf -t Grouped_by_Sequencing_Site -g Book9.csv seq_site_1,seq_site_2 -o output_3

then I see this error (where L85 is the first row for the first variant in the VCF):

Loading VCF file into memory for visualization ERROR: LoadError: incomplete GeneticVariation.VCF.Reader input on line 85

mjmontague commented 4 years ago

As expected, the test.vcf file is incomplete:

ERROR: LoadError: incomplete GeneticVariation.VCF.Reader input on line 501

I created a test vcf with the first two variants using the following command:

head -n 3002 SFARI_priority.vcf > test.VCF

And I see the same error. I'm attaching the original files here mjmontague.zip

mjmontague commented 4 years ago

FYI: There's also a sample with ID "0E8"

mjmontague commented 4 years ago

I think that's it. When I change ID 0E8 to A0E8, then I get the following error:

ERROR: LoadError: ArgumentError: column name :1000.0 not found in the data frame

Of course, I also have sample 1E3 in the dataset. Is there a way to avoid this? I prefer not to change sample IDs. The issue with converting from scientific notation is always a problem in Excel, but I didn't expect it to be an issue when importing a raw CSV into other tools.

mjmontague commented 4 years ago

Update: I altered sample IDs for 0E8 and 1E3 in both the VCF and metadata CSV, and the tool runs successfully. Woohoo! I'll consider re-naming the samples in the original VCF to avoid similar dilemmas in the future.

gtollefson commented 4 years ago

@mjmontague I'm glad you we're able to run it. That's interesting that the CSV.read function automatically converts scientific notation to standard form... I'm considering a fix to avoid converting the input VCF to standard form so this doesn't happen in the future.

Please let me know if you have any more questions and don't forget to cite us if you use the visualizations in a publication. I hope you analysis is going well!